10,000 Matching Annotations
  1. Nov 2025
    1. Author Response

      Reviewer 1

      Employing in vitro and Drosophila model, the authors interrogate which domain of Hsp27 binds to which region on Tau, and how these interactions facilitate the proteinaceous aggregation. They utilized various biochemical, biophysical, cellular, and genetic tools to dissect the association, and identified the structural basis for the specific recognition of Hsp27 to pathogenic p-Tau. Conceivably, Hsp27 may play some role in preventing Tau abnormal aggregation and p-Tau pathology in AD. Overall, the data support the main claim, especially, the biophysical data are very impressive. Nevertheless, the manuscript could be strengthened by complementary cellular or biochemical methods for validation. For example, the authors can use a stably transfected Tau cell line to interrogate Hsp27's role in its cellular aggregation or proteinaceous inclusions by immunoblotting. Immunofluorescent and immunohistochemical staining and IB with different antibodies may be conducted to validate the observations.

      REPLY: We sincerely thank the reviewer for the positive assessment of our work, and for providing very insightful suggestions. We appreciate the reviewer for considering our biophysical data to be impressive. We totally agree with the reviewer that the work could be strengthened by complementary cellular methods for validation. In our work, we used the Drosophila tauopathy model, where expression of human TauR406W in the Drosophila nervous system leads to age-dependent neurodegeneration recapitulating some of the salient features of tauopathy in FTDP-171,2, to interrogate the role of Hsp27 in aggregation and proteinaceous inclusions of pTau.

      In our Drosophila Tau model study, three different antibodies including a total Tau antibody 5A63, a pTauSer262 specific antibody4, and a hyper-phosphorylated Tau antibody AT8 that recognizes hyper-phosphorylation of Tau at Ser202 and Thr205 sites5 were used in western blot analysis to explore the role of Hsp27. As shown in Figure R1-1A and 1B, overexpression of Hsp27 significantly reduced the level of both pTauSer262 and hyper-phosphorylated Tau at both 2 and 10 days after eclosion (DAE). In addition, we further examined the morphology of the fly brain as well as the accumulation of hyper-phosphorylated Tau by immunofluorescence staining. Consistent with previous findings, brains with neuronal expression of TauR406W exhibited an accumulation of filamentous pTau and a reduction of brain neuropil size indicative of neurodegeneration (Figure R11C-F). Importantly, overexpression of Hsp27 restored the size of brain neuropil and suppressed the accumulation of filamentous pTau (Figure R1-1C-F), suggesting that Hsp27 protects against mutant TauR406W - induced neurodegeneration. Taken together, our Drosophila results show that Hsp27 protects against synaptic dysfunction in a Drosophila tauopathy model by reducing pTau aggregation, which well supports our biophysical data.

      Figure R1-1 Hsp27 reduces pTau level and protects against pTau-induced synaptopathy in Drosophila. (This figure represents Fig. 2A-F in the revised manuscript) (A) Brain lysates of 2 and 10 days after eclosion (DAE) wild-type (WT) flies (lanes 1 and 6), flies expressing human Tau with GFP (lanes 4 and 9), or human Tau with Hsp27 (lanes 5 and 10) in the nervous system were probed with antibodies for disease-associated phospho-tau epitopes S262, Ser202/Thr205 (AT8), and total Tau (5A6). Actin was probed as a loading control. Brain lysates of flies carrying only UAS elements were loaded for control (lanes 2, 3, 7, and 8). (B) Quantification of protein fold changes in (A). The levels of Tau species were normalized to actin. Fold changes were normalized to the Tau+GFP group at 2 DAE. n = 3. (C) Brains of WT flies or flies expressing Tau+GFP or Tau+Hsp27 in the nervous system at 2 DAE were probed for AT8 (heatmap) and Hsp27 (green), and stained with DAPI (blue). Scale bar, 30 μm. (D-F) Quantification of the Hsp27 intensity (D, data normalized to WT), brain optic lobe size (E), and AT8 intensity (F, data normalized to the Tau+GFP group). n = 4.

      Reviewer 2

      Abnormal accumulation and aggregation of amyloid-β protein are one of the main pathological hallmarks of Alzheimer's disease. It is well known that molecular chaperones play central roles in regulating tau function and amyloid assembly in disease. In this manuscript, Zhang, Zhu, Lu, Liu, et al., have investigated that Hsp27, a member of the small heat shock protein, specifically binds to phosphorylated Tau, which prevents pTau fibrillation in vitro and in a Drosophila tauopathy model. Using NMR spectroscopy and cross-linking mass spectrometry, the authors found that the N-terminal domain of Hsp27 directly binds to phosphorylation sites of pTau. Overall, the study is important and provides the demonstration of interactions between Hsp27 and pTau.

      REPLY: We sincerely thank the reviewer for the positive remarks of this work, and appreciate that the reviewer summarizes the major conclusions of our manuscript, and evaluates our work is important in the area of fundamental biology of the interaction between chaperones and clients, and its implications in AD pathology.

    1. Author Response

      Reviewer #2 (Public Review):

      Activation of TEAD-dependent transcription by YAP/TAZ has been implicated in the development and progression of a significant number of malignancies. For example, loss of function mutations in NF2 or LATS1/2 (known upstream regulators that promote YAP phosphorylation and its retention and degradation in the cytoplasm) promote YAP nuclear entry and association with TEAD to drive oncogenic gene transcription and occurs in >70% of mesothelioma patients. High levels of nuclear YAP have also been reported for a number of other cancer cell types. As such, the YAP-TEAD complex represents a promising target for drug discovery and therapeutic intervention. Based on the recently reported essential functional role for TEAD palmitoylation at a conserved cysteine site, several groups have successfully targeted this site using both reversible binding non-covalent TEAD inhibitors (i.e., flufenamic acid (FA), MGH-CP1, compound 2 and VT101~107), as well as covalent TEAD inhibitors (i.e., TED-347, DC-TEADin02, and K-975), which have been demonstrated to inhibit YAP-TEAD function and display antitumor activity in cells and in vivo.

      Here, Fan et al. disclose the development of covalent TEAD inhibitors and report on the therapeutic potential of this class of agents in the treatment of TEAD-YAP-driven cancers (e.g., malignant pleural mesothelioma (MPM)). Optimized derivatives of a previously reported flufenamic acid-based acrylamide electrophilic warhead-containing TEAD inhibitor (MYF-01-37, Kurppa et al. 2020 Cancer Cell), which display improved biochemical- and cell-based potency or mouse pharmacokinetic profiles (MYF-03-69 and MYP-03-176) are described and characterized.

      Strengths:

      All of the authors' claims and conclusions are very well supported and justified by the data that is provided. Clear improvements in biochemical- and cell-based potencies have been made within the compound series. Cell-based selective activities in the HIPPO pathway defective versus normal/control cell types are established. Transcriptional effects and the regulation of BMF proapoptotic mRNA levels are characterized. A 1.68 A X-Ray co-crystal structure of MYF-03-69 covalently bound to TEAD1 via Cys359 is provided. In vivo efficacy in a relevant xenograft is demonstrated, using a 30 mg/kg, BID PO dose.

      We thank the reviewer for appreciating and highlighting the strengths of our study.

      Weaknesses:

      Beyond the impact on BMF gene regulation, new biological insights reported here for this compound series are moderate. Progress and differentiation with respect to activity and/or ADME PK profiles relative to the very closely related and previously described (Keneda et al. 2020 Am J Cancer Res 10:4399. PMID 33415007) acrylamide-based covalent TEAD inhibitor K-975 (identical 11 nM cell-based potencies when compared head-to-head and identical reported in vivo efficacy doses of 30 mg/kg) is not entirely clear. Demonstration of on-target in vivo activity is lacking (e.g., impact on BMF gene expression at the evaluated exposure levels).

      We thank the reviewer’s question. We have compared mouse liver microsome stability and hepatocyte stability of K-975 and MYF-03-176 and found that K-975 is metabolically less stable.

      Consistently, when NCI-H226 cells derived xenograft mice were dosed with 30 mg/kg K-975 twice daily, the tumors kept growing and reach more than 1.5-fold volume on 14th day. While with the same dosage, MYF-03-176 showed a significant tumor regression. K-975 did not reach such efficacy even with 100 or 300 mg/kg twice daily, either in NCI-H226 or MSTO-211H CDX mouse model according to the paper (Keneda et al. 2020 Am J Cancer Res 10:4399).

      To demonstrate the on-target in vivo activity, we tested expression of the TEAD downstream genes and BMF in tumor sample after 3-day BID treatment (PD study) and we observed reduction of CTGF, CYR61, ANKRD1 and an increase of BMF, which indicates an on-target activity in vivo.

    1. Author Response

      Reviewer #2 (Public Review):

      This paper by Angueyra, et al., adds to the field’s current understanding of photoreceptor specification and factors regulating opsin expression in vertebrates. Current models of specification of vertebrate photoreceptors are largely based on studies of mammals. However, a great number of animals including teleosts express a wider array of photoreceptor subtypes. Zebrafish for example have 4 distinct cone subtypes and rods. The approach is sound and the data are quite convincing. The only minor weaknesses are that the statistical analyses need to be revisited and the discussion should be a bit more focused.

      To identify differentially expressed transcription factors, the authors performed bulk RNA-seq of pooled, hand-sorted photoreceptors. The selection criterion was tightly controlled to limit unhealthy cells and cellular debris from other photoreceptors subtypes. The pooling of cells provided a considerable depth of sequencing, orders of magnitude better than scSeq. The authors identified known transcription factors and several that appear to be novel or their role has not been determined. The data are made available on the PIs website as is a program to access and compare the gene expression data.

      The authors then used CRISPR/Cas9 gene targeting of two known and several novel factors identified in their analysis for effects on cell fate decisions and opsin expression. Phenotyping performed on the injected larvae is possible, and the target genes were applied and sequenced to demonstrate the efficiency of the gene targeting. Targeting of 2 genes with know functions in photoreceptor specification in zebrafish, Tbx2b and Foxq2 resulted in the anticipated changes in cell fate, albeit, the strength of the alterations in cell fate in the F0 larvae appears to be less than the published phenotypes for the inherited alleles. Interestingly, the authors also identified the expression of an RH2 opsin in the SWS2 another cone type. The changes are subtle but important.

      The authors then targeted tbx2a, the function of which was not known. The result is quite interesting as it matches the increase of rods and decrease of UV cones observed in tbx2b mutants. However, the injected animals also showed RH2 opsin expression but are now in the LWS cone subtype. These data suggest that Tbx2 transcription factors repress misexpression of opsins in the wrong cell type.

      The authors also show that targeting additional differentially expressed factors does not affect photoreceptor fate or survival in the time frame investigated. These are important data to present. For these or any of the other targeted genes above, did the authors test for changes in photoreceptor number or survival?

      We have attempted to address this point, but the answer is not clear cut. We used activated caspase-3 inmmunolabeling as a marker of apoptosis (Lusk and Kwan 2022). At 5 dpf, the age we chose to make quantifications, we don’t see an increase in activated caspase-3 positive cells when we compare control and tbx2a F0 mutants (Reviewer Figure 1A-B). Labeled cells are very rare and located near the ciliary marginal zone irrespective of genotype. This suggests that there is no detectable active death at this late stage of development in tbx2 F0 mutants. Earlier in development, at 3 dpf, when photoreceptor subtypes first appear, there is also a normal wave of apoptosis in the retina (Blume et al. 2020; Biehlmaier, Neuhauss, and Kohler 2001), resulting in many cells positive for activated caspase-3; our preliminary quantifications don’t show a marked increase in the number of labeled cells in tbx2a F0 mutants, but we consider that it’s likely that subtle effects might be obscured by the physiological wave of apoptosis (Reviewer Figure 1C-D).

      Reviewer Figure 1 - Assessment of apoptosis in tbx2a F0 mutants. (A-B) Confocal images of 5 dpf larval eyes of control (A and A’) and tbx2a F0 mutants (B and B’) counterstained with DAPI (grey) and immunolabeled against activated Caspase 3 (yellow) show sparse and dim labeling, restricted to cells located in the ciliary marginal zone, without clear differences between groups. (C-D) Confocal images of 3 dpf larval eyes of control (C and C’) and tbx2a F0 mutants (D and D’) immunolabeled against activated Caspase 3 show many positive cells, located in all retinal layers, as expected from physiological apoptosis at this stage of development and without clear differences between groups.

      Furthermore, the additional single-cell RNA-seq datasets we have reanalyzed suggest that tbx2a and tbx2b are expressed by other retinal neurons and progenitors and not just photoreceptors (Reviewer Figure 2), further confounding attempts at the quantification of apoptosis specifically in photoreceptor progenitors.

      Reviewer Figure 2 – Expression of tbx2 paralogues across retinal cell types. The transcription factors tbx2a and tbx2b are expressed by many retinal cells. Plots show average counts across clusters in RNA-seq data obtained by Hoang et al. (2020).

      At this stage, we consider that fully resolving this issue is important and will require considerably more work, which we will pursue in the future using full germline mutants and live-imaging experiments.

      Reviewer #3 (Public Review):

      Angueyra et al. tried to establish the method to identify key factors regulating fate decisions in the retinal visual photoreceptor cells by combining transcriptomic and fast genome editing approaches. First, they isolated and pooled five subtypes of photoreceptor cells from the transgenic lines in each of which a specific subtype of photoreceptor cells are labeled by fluorescence protein, and then subjected them to RNA-seq analyses. Second, by comparing the transcriptome data, they extracted the list of the transcription factor genes enriched in the pooled samples. Third, they applied CRISPR-based F0 knockout to functionally identify transcription factor genes involved in cell fate decisions of photoreceptor subtypes. To benchmark this approach, they initially targeted foxq2 and nr2e3 genes, which have been previously shown to regulate S-opsin expression and S-cone cell fate (foxq2) and to regulate rhodopsin expression and rod fate (nr2e3). They then targeted other transcription factor genes in the candidate list and found that tbx2a and tbx2b are independently required for UV-cone specification. They also found that tbx2a expressed in the L-cone subtype and tbx2b expressed in L-cones inhibit M-opsin gene expression in the respective cone subtypes. From these data, the authors concluded that the transcription factors Tbx2a and Tbx2b play a central role in controlling the identity of all photoreceptor subtypes within the retina.

      Overall, the contents of this manuscript are well organized and technically sound. The authors presented convincing data, and carefully analyzed and interpreted them. It includes an evaluation of the presented data on cell-type specific transcriptome by comparing it with previously published ones. I think the current transcriptomic data will be a valuable platform to identify the genes regulating cell-type specific functions, especially in combination with the fast CRISPR-based in vivo screening methods provided here. I hope that the following points would be helpful for the authors to improve the manuscript appropriately.

      1) The manuscript uses the word “FØ” quite often without any proper definition. I wonder how “Ø” should be pronounced - zero or phi? This word is not common and has not been used in previous publications. I feel the phrase “F0 knockout,” which was used in the paper cited by the authors (Kroll et al 2021), is more straightforward. If it is to be used in the manuscript, please define “FØ” and “CRISPR-FØ screening” appropriately, especially in the abstract.

      We have made changes to replace “FØ” to “F0.” In our other citation (Hoshijima et al., 2019), “F0 embryo” was used throughout the paper. Following our references and Dr Kojima’s suggestion, we adopted “F0 mutant larva” as the most straightforward and less confusing term. We have also made changes in the abstract to define our approach more clearly and made appropriate changes throughout the manuscript.

      2) Figure 1-supplement 1 shows that opn1mw4 has quite high (normalized) FPKM in one of the S-cone samples in contrast to the least (or no) expression in the M-cone samples, in which opn1mw4 is expected to be detected. The authors should address a possible origin of this inconsistent result for opn1mw4 expression as well as a technical limitation of using the Tg(opn1mw2:egfp) line for detection of opn1mw4 expression in the GFP-positive cells.

      In Figure 1 - Supplement 1, we had attempted to provide a summarized figure of all phototransduction genes, but the big differences in expression levels — in particular, the high expression of opsins genes — forced us to use gene-by-gene normalization for display. Without normalization, the expression of opn1mw4 is very low across all samples, and its detection in that sole S-cone sample can likely be attributed to some degree of inherent noise in our methods. We have revised Figure 1 - Supplement 1: we find that we can avoid gene-by-gene normalization and still provide a good summary of the expression of phototransduction genes if the heatmap is broken down by gene families, which have more similar expression levels. In addition, we have added caveats to the use of the Tg(opn1mw2:egfp) line as our sole M-cone marker in the results section describing our RNA-seq approach, including our inability to provide data on Opn1mw4-expressing M cones.

      3) The manuscript lacks a description of the sampling time point. It is well known that many genes are expressed with daily (or circadian) fluctuation (cf. Doherty & Kay, 2010 Annu. Rev. Genet.). For example, the cone-specific gene list in Fig.2C includes a circadian clock gene, per3, whose expression was reported to fluctuate in a circadian manner in many tissues of zebrafish including the retina (Kaneko et al. 2006 PNAS). It appears to be cone-specific at this time point of sample collection as shown in Fig.2, but might be expressed in a different pattern at other time points (eg, rod expression). The authors should add, at least, a clear description of the sampling time points so as to make their data more informative.

      We have included this information in the materials and methods. We collected all our samples during the most active peak of the zebrafish circadian rhythm between 11am and 2pm (3h to 6h after light onset) to avoid the influence of circadian fluctuations in our analysis.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors use a newly developed object-space memory task comprising of a "Stable" version and "Overlapping" version where two objects are presented in two locations per trial in a square open field. Each version consists of 5 training trials of 5-min presentations of an object-space configuration, with both object locations staying constant across training trials in the Stable condition, and only one object location staying fixed in the Overlapping condition. Memory is tested in a test trial 24 hours later where the opposite configuration is presented - overlapping configuration presented for the Stable condition and stable configuration presented for the Overlapping condition - with the thesis that memory in this test trial for the Overlapping condition will depend on the accumulated memory of spatial patterns over the training trials, whereas memory for the test trial in the Stable condition can be due to episodic memory of last trial or accumulated memory. Memory is quantified using a Discrimination Index (DI), comparing the amount of time animals spend exploring the two object locations.

      Here, animals in other groups are also presented with an interference trial equivalent to the test trial, to test if the memory of the Overlapping condition can be disrupted. The behavioral data show that for RGS14 over-expressing animals, memory in the Overlapping condition is diminished compared to controls with no interference or controls where over-expression is inhibited, whereas memory in the Stable condition is enhanced. This is interpreted as interference in semantic-like memory formation, whereas one-shot episodic memory is improved. The authors speculate that increased cortical plasticity should lead to increased and larger delta waves according to the sleep homeostasis hypothesis, and observe that instead increased cortical plasticity leads to less non-REM sleep and smaller delta waves, with more prefrontal neurons with slower firing rates (presumably more plastic neurons). They further report increased hippocampal-cortical theta coherence during task and REM sleep, increased NonREM oscillatory coupling, and changes in hippocampal ripples in RGS14 over-expressing animals.

      While these results are interesting, there are several issues that need to be addressed, and the link between physiology and behavioral results is unclear.

      1) The behavioral results rely on the interpretation that the Overlapping condition corresponds to semantic-like memory and the Stable condition corresponds to episodic-like memory. While the dissociation in memory performance due to interference seen in these two conditions is intriguing, the Stable condition can correspond not just to the memory of the previous trial but also accumulated memory of a stable spatial pattern over the 5 testing trials, similar to accumulated memory of a changing spatial pattern in the Overlapping pattern.

      Yes! We completely agree on this. We do not claim the stable condition corresponds to episodic-like memory, instead we refer to it as simple memory, since it can be solved either way (one trial memory or cumulative memory). We now expanded this in the discussion to make it clearer.

      Here, it is puzzling that in the behavioral control with no interference (Figure 1D), memory in the Stable and Overlapping condition is unchanged in the test trial, with the DI statistically at 0 in the test trial. In the original description of the Object Space task by the authors in the referenced paper, the measure of memory was a Discrimination Index significantly higher than 0 in both the Stable and Overlapping conditions. This discrepancy needs to be reconciled. Is the DI for the interference trial shown in Fig. S1 significantly different than 0? No statistics or description is provided in the figure legend here.

      As mentioned above, we apologize that we oversimplified the description. The 24h interference trial would be what corresponds to the original test trial. We added a clarifying figure for comparison in S1 (bar graph in addition to the violin plot) and stats. Performance was for all groups and conditions above chance, replicating our previous results.

      2) The physiology experiments compare Home cage (HC) conditions to the Object Space task (OS) throughout the manuscript. While some differences are seen in the control and RGS14 over-expressing animals, there is no comparison of the Stable vs. Overlapping condition in the physiology experiments. This precludes making explicit links between physiological observations and behavioral effects.

      As also mentioned above, we have now added analysis exploring the detailed OS conditions. We would like to thank the reviewers for giving us the opportunity of doing so.

      3) The authors speculate that learning will result in larger and more delta waves as per the synaptic homeostasis hypothesis. It should be noted here that an alternative hypothesis is that there should also be a selective increase in synaptic plasticity for learning and consolidation. The authors do observe that control animals show more frequent and higher-amplitude delta waves, but rather than enhancing this process, RGS14 animals with increased plasticity show the opposite effect. How can this be reconciled and linked with the behavioral data in the Stable and Overlapping condition?

      In the context of the Object Space Task, we would expect all behavioural conditions (Stable and Overlapping) to induce synaptic changes since learning does occur also in the Stable condition (see also performance on 24h trial). Thus, especially homeostatic responses such as increase in delta amplitude, we would expect for all experiences independent if subtle statistical rules are presented or not. In contrast, detailed processing, extracting underlying regularities is rather proposed by the Sleep for Active Systems Consolidation Hypothesis to occur during hippocampal-cortical interactions in form of delta/ripple/spindle interactions (with different theories emphasising different types of interactions). As mentioned above, we now add a more specific analysis in this regards, where we can show that the two OS conditions that involve moving objects (where thus potentially statistical regularities can be extracted) show a higher percentage of ripples occurring after large slow oscillations in comparison to home cage or the simple learning condition Stable. In contrast, RGS14 already has higher participation in both control conditions, emphasising that in these animals all experiences are treated by the brain as significant learning condition, explaining the behavioural effect (increased interference due to better memory for the interference). Further, we expanded in the discussion how in RGS we sometimes see an enhancement of learning effects but sometimes see a more complex interaction of what we would expect from physiological learning.

      Similarly, there is an increase in slower-firing neurons in RGS14 over-expressing animals. Slower-firing neurons have been proposed to be more plastic in the hippocampus based on their participation in learned hippocampal sequences, but appropriate references or data are needed to support the assertion that slower-firing neurons in the prefrontal cortex are more plastic.

      As described above, we have expanded the discussion including other citations that also consider the cortex. We can show that our changes would be expected if one turns the cortex as plastic as the hippocampus.

      4) It is noted that changing cortical plasticity influences hippocampal-cortical coupling and hippocampal ripples, suggesting a cortical influence on hippocampal physiological patterns. It has been previously shown that disrupting prefrontal cortical activity does alter hippocampal ripples and hippocampal theta sequences (Schmidt et al., 2019; Schmidt and Redish, 2021). The current results should be discussed in this context.

      We would like to thank the reviewer for these suggestions, they are now incorporated in the manuscript.

      Reviewer #2 (Public Review):

      In this paper, the authors provide evidence to support the longstanding proposition that a dual-learning system/systems-level consolidation (hippocampus attains memories at a fast pace which are eventually transmitted to the slow-learning neocortex) allows rapid acquisition of new memories while protecting pre-existing memories. The authors leverage many techniques (behavior, pharmacology, electrophysiology, modelling) and report a host of behavioral and electrophysiological changes on induction of increased medial prefrontal cortex (mPFC) plasticity which are interesting and will be of significant interest to the broad readership.

      The experimental design and analyses are convincing (barring some instances which are discussed below). The following recommendations will bolster the strength/quality of the manuscript:

      1) Certain concerns regarding the interpretation and analysis of the behavioral data remain. The authors need to clarify if increased mPFC plasticity leads to only an increase in one-shot memory or 'also' interference of previous information. It seems that the behavioral results could also be explained by the more parsimonious explanation that one-shot memory is improved. Do the current controls tease apart these two scenarios?

      We agree we cannot disentangle if one memory is just stronger than the other or if its an overwriting effect. We added this now to the discussion. Of note, we do not think it actually would be possible to distinguish these two effects behaviourally in rodents, or at least we cannot think of a fitting study design that would enable the contrast.

      Additionally, the authors need to clarify why the 'no trial' and 'anisomycin' controls for the stable task perform at chance levels on exposure to a new object-place association on test day (Fig 1D).

      Violin plots are sometimes hard to see. Here simple bar plots where you can see that the animals are not at chance at the 72h test in the control conditions.

      Finally, further description of how the discrimination index (exploration time of novel-exploration time of familiar/sum of both) is recommended i.e., in the stable condition, which 'object' is chosen as 'novel' (as both are in the same locations) for computing the index (Fig 1). Do negative DI values imply a neophobia to novel objects (and thus are a form of memory; this is also crucial because the modelling results (Fig 1E) use both neophilia and neophobia while negative discrimination indexes are considered similar to 0 for interpreting the behavioral results, as stated on page 3, lines 84-86?

      We added this now to the methods (For Overlapping it is moved location – stable location, for Stable it is location-to-be-moved-at-test – stable location and for random which is assigned as moved and stable is random, and then for each divided by total time). We agree that neophilia/neophobia (especially changes in the distribution) can be an issue and have discussed it in detail in Schut et al NLM 2020 where we see difference in absolute beta values (thus controlling for philia/phobia differences). We also discuss there why it is difficult to control for this in the DI in more detail. In short, one could use absolute values but then it is difficult to determine what a group chance-level would look like. However, luckily here there is not issue since we did not observe difference in neophilic or phobic tendencies while running the experiments. Critically the interference trial (that can also function as simple test trial) confirms that as a group animals show positive DI and neophilia.

      2) The authors report lower firing rates in RGS14414 animals during the task in Fig 2F. It is indeed remarkable how large the reported differences are. The authors need to rule out any differences in the behavioral state of the animals in the two groups during the task, i.e., rest vs. active exploration/movement dynamics. Are only epochs during the task while the animals interact with the objects used for computing the firing rates (same epochs as Fig 1)? If not, doing so will provide a useful comparison with Fig 1. Additionally, although the authors make the case for slow firing rate neurons being important for plasticity (based on Grosmark and Buzsaki, 2016), it is crucial to note that the firing rate dynamic (slow vs. fast) in that study for the hippocampus is defined based on the whole recorded session (predominated by sleep), indeed the firing rates of the two groups (slow vs. fast/plastic vs. rigid) during the task/maze-running do not differ in that study. Therefore, the results here seem incongruent with the Grosmark and Buzsaki paper. Since this finding is central to the main claim of the authors, it either warrants further investigation or a re-interpretation of their results.

      As mentioned in the main points, we now added the firing rate analysis (including new groups splits) for wake in the sleep box, NREM and REM separately. Each time the same results are obtained. Currently, we do not yet have the tracking and video synchronization set-up, therefore we cannot split the task for specific behaviours.

      However, we now also cite Buzsaki’s original log-normal brain review, where he first proposed the idea. There he also shows same effects as we do, in that the general firing rate distribution is the same for task and different sleep stages, just overall shifted. The analysis from Grosmark included more strigent subselection of neurons to be able to also argue that incorporation into run/replay-sequences could not have been biased by firing rate per se (instead of plasticity). However, the original proposition from Buzsaki does fit to our results. He further presents hippocampus vs cortex firing rates, which also confirm the idea (hippocampus more plastic and has slower firing rates). We included this figure above in the general comments. Further, we now expanded the discussion in this point.

      3) A concern remains as to how many of the electrophysiological changes they observe (firing rate differences, LFP differences including coupling, sleep state differences, Figs. 2-4) support their main hypothesis or are a by-product of injection of RGS14414 (for instance, one might argue that an increased 'capability' to learn new information/more plasticity might lead to more NREM sleep for consolidation, etc.). The authors need to carefully interpret all their data in light of their main hypothesis, which will substantially improve the quality/strength of the manuscript.

      We now expanded the discussion, included more structure and also include that we cannot disentangle if the cellular changes or sleep oscillation changes or an interaction of both is the cause of the result. Furthermore, we added that we cannot distinguish if the interference memory is stronger or actually overwrites the original training memory.

      Reviewer #3 (Public Review):

      The authors set out to test the idea that memories involve a fast process (for the acquisition of new information) and a slow process (where these memories are progressively transferred/integrated into more-long term storage). The former process involves the hippocampus and the latter the cerebral cortex. This 'dual-learning' system theoretically allows for new learning without causing interference in the consolidation of older memories. They test this idea by artificially increasing plasticity in the pre-limbic cortex and measuring changes in different learning/memory tasks. They also examined electrophysiological changes in sleep, as sleep is linked to memory formation and synaptic plasticity.

      The strengths of the study include a) meticulous analyses of a variety of electrophysiological measurements b) a combination of neurobiological and computational tools c) a largely comprehensive analysis of sleep-based changes. Some weaknesses include questions about the technique for increasing cortical plasticity (is this physiological?) and the absence of some additional experiments that would strengthen the conclusions. However, overall, the findings appear to support the general idea under examination.

      This study is likely to be very impactful as it provides some really new information about these important neural processes, as well as data that challenges popular ideas about sleep and synaptic plasticity.

      We would like to thank the reviewer for these positive comments. Answers to the weaknesses are presented below in the recommendations for the authors.

    1. Author Response

      Reviewer 1 (Public Review):

      To me, the strengths of the paper are predominantly in the experimental work, there's a huge amount of data generated through mutagenesis, screening, and DMS. This is likely to constitute a valuable dataset for future work.

      We are grateful to the reviewer for their generous comment.

      Scientifically, I think what is perhaps missing, and I don't want this to be misconstrued as a request for additional work, is a deeper analysis of the structural and dynamic molecular basis for the observations. In some ways, the ML is used to replace this and I think it doesn't do as good a job. It is clear for example that there are common mechanisms underpinning the allostery between these proteins, but they are left hanging to some degree. It should be possible to work out what these are with further biophysical analysis…. Actually testing that hypothesis experimentally/computationally would be nice (rather than relying on inference from ML).

      We agree with the reviewer that this study should motivate a deeper biophysical analysis of molecular mechanisms. However, in our view, the ML portion of our work was not intended as a replacement for mechanistic analysis, nor could it serve as one. We treated ML as a hypothesis-generating tool. We hypothesized that distant homologs are likely to have similar allosteric mechanisms which may not be evident from visual analysis of DMS maps. We used ML to (a) extract underlying similarities between homologs (b) make cross predictions across homologs. In fact, the chief conclusion of our work is that while common patterns exist across homologs, the molecular details differ. ML provides tantalizing evidence to this effect. The conclusive evidence will require, as the reviewer rightly suggests, detailed experimental or molecular dynamics characterization. Along this line, we note that we have recently reported our atomistic MD analysis of allostery hotspots in TetR (JACS, 2022, 144, 10870). See ref. 41.

      Changes to manuscript:<br /> “Detailed biophysical or molecular dynamics characterization will be required to further validate our conclusions(38).”

      Reviewer 3 (Public Review):

      However - at least in the manuscript's present form - the paper suffers from key conceptual difficulties and a lack of rigor in data analysis that substantially limits one's confidence in the authors' interpretations.

      We hope the responses below address and allay the reviewer’s concerns.

      A key conceptual challenge shaping the interpretation of this work lies in the definition of allostery, and allosteric hotspot. The authors define allosteric mutations as those that abrogate the response of a given aTF to a small molecule effector (inducer). Thus, the results focus on mutations that are "allosterically dead". However, this assay would seem to miss other types of allosteric mutations: for example, mutations that enhance the allosteric response to ligand would not be captured, and neither would mutations that more subtly tune the dynamic range between uninduced ("off) and induced ("on") states (without wholesale breaking the observed allostery). Prior work has even indicated the presence of TetR mutations that reverse the activity of the effector, causing it to act as a co-repressor rather than an inducer (Scholz et al (2004) PMID: 15255892). Because the work focuses only on allosterically dead mutations, it is unclear how the outcome of the experiments would change if a broader (and in our view more complete) definition of allostery were considered.

      We agree with the reviewer that mutations that impact allostery manifest in many different ways. Furthermore, the effect size of these mutations runs the full gamut from subtle changes in dynamic range to drastic reversal of function. To unpack allostery further, allostery of aTF can be described, not just by the dynamic range, but by the actual basal and induced expression levels of the reporter, EC50 and Hill coefficient. Given the systemic nature of allostery, a substantial fraction of aTF mutations may have some subtle impact on one or more of these metrics. To take the reviewer’s argument one step further, one would have to accurately quantify the effect size of every single amino acid mutation on all the above properties to have a comprehensive sequence-function landscape of allostery. Needless to say, this is extremely hard! Resolution of small effect sizes is very difficult, even at high sequencing depth. To the best of our knowledge, a heroic effort approaching such comprehensive analysis has been accomplished so far only once (PMID: 3491352).

      Our focus, therefore, was to screen for the strongest phenotypic impact on allostery i.e., loss of function. Mutations leading to loss of function can be relatively easily identified by cell-sorting. Because our goal was to compare hotspots across homologs, we surmised that loss of function mutations, given their strong phenotypic impact, are likely to provide the clearest evidence of whether allosteric hotspots are conserved across remote homologs.

      The reviewer raised the point of activity-reversing mutations. Yes, there are activity reversing mutations in TetR. However, they represent an insignificant fraction. In the paper cited by the reviewer, there are 15 activity-reversing mutations among 4000 screened. Furthermore, the paper shows that activity-reversing in TetR requires two-tofour mutations, while our library is exclusively single amino acid substitutions. For these reasons, we did not screen for activity-reversing mutations. Nonetheless, we agree with the reviewer that screening for activity-reversing mutations across homologs would be very interesting.

      The separation in fluorescence between the uninduced and induced states (the assay dynamic range, or fold induction) varies substantially amongst the four aTF homologs. Most concerningly, the fluorescence distributions for the uninduced and induced populations of the RolR single mutant library overlap almost completely (Figure 1, supplement 1), making it unclear if the authors can truly detect meaningful variation in regulation for this homolog.

      Yes, the reviewer is correct that the fold induction ratio varies among the four aTF homologs. However, we note that such differences are common among natural aTFs. Depending on the native downstream gene regulated by the aTF, some aTFs show higher ligand-induced activation, and others are lower. While this is not a hard and fast rule, aTFs that regulate efflux pumps tend to have higher fold induction than those that regulate metabolic enzymes. In summary, the variation in fold induction among the four aTFs is not a flaw in experimental design nor indicates experimental inconsistency but is instead just an inherent property of protein-DNA interaction strength and the allosteric response of each aTF.

      Among the four aTFs, wildtype RolR has the weakest fold induction (15-fold) which makes sorting the RolR library particularly challenging. To minimize false positives as much as possible, we require that dead mutant be present in (a) non-fluorescent cells after ligandinduction (b) non-fluorescent cells before ligand-induction (c) at least two out of the three replicates for both sorts. Additionally, for RolR specifically, we adjusted the nonfluorescent gate to be far more stringent than the other three aTFs (Fig. 1 – figure supplement 1). Furthermore, we assign residues as allosteric hotspots, not individual dead mutations. This buffers against false strong signals from stray individual dead mutations. Finally, the top interquartile range winnows them to residues showing strong consistent dead phenotype. As a result of these “safeguards” we have built in, the number of allosteric hotspots of RolR (57) is comparable to the other three aTFs (51, 53 and 48). This suggests that we are not overestimating the number of hotspots despite the weaker fold induction of RolR. We highlight in a new supplementary figure (Figure 1 – figure supplement 4) that changing the read count threshold from 5X to 10X produces near identical patterns of mutations suggesting that our results are also robust to changes in ready depth stringency.

      Changes to manuscript: In response to the reviewer's comment, we have added the following sentence.

      “We note that the lower fold induction (dynamic range) of RolR makes it particularly challenging to separate the dead variants from the rest.”

      The methods state that "variants with at least 5 reads in both the presence and absence of ligand in at least two replicates were identified as dead". However, the use of a single threshold (5 reads) to define allosterically dead mutations across all mutations in all four homologs overlooks several important factors:

      Depending on the starting number of reads for a given mutation in the population (which may differ in orders of magnitude), the observation of 5 reads in the gated nonfluorescent region might be highly significant, or not significant at all. Often this is handled by considering a relative enrichment (say in the induced vs uninduced population) rather than a flat threshold across all variants.

      We regret the lack of clarity in our presentation. We wish to better explain the rationale behind our approach. First, we understand the reviewer’s point on considering relative enrichment to define a threshold. This approach works well in DMS experiments involving genetic selections, which is commonly the case, because activity scales well with selection stringency. One can then pick enrichment/depletion relative to the middle of the read count distribution as a measure of gain or loss of function.

      Second, this strategy does not, in practice, work well for cell-sorting screens. While it may be tempting to think of cell sorting as comparably activity-scaled as genetic selections, in reality, the fidelity of fluorescent-activated cell sorters is much lower. Making quantitative claims of activity based on cell sorting enrichment can be risky. It is wiser to treat cell sorting results as yes/no binary i.e., does the mutation disrupt allostery or not. More importantly, the yes/no binary classification suffices for our need to identify if a certain mutation adversely impacts allosteric activity or not.

      Third, the above argument does not imply that all mutations have the same effect size on allostery. They don’t. We capture the effect size on individual residues, not individual mutations, by counting the number of dead mutations at a residue position. This is an important consideration because it safeguards us from minor inconsistencies that inevitably arise from cell sorting.

      Fourth, a variant to be classified as allosterically dead, it must be present both in uninduced and induced DNA-bound populations in at least two out of three replicates (four conditions total). This is a stringent criterion for selecting dead variants resulting in highly consistent regions of importance in the protein even upon varying read count thresholds. To the extent possible, we have minimized the possibility of false positive bleed-through.

      Finally, two separate normalizations were performed on the total sequence reads to be able to draw a common read count threshold 1) between experimental conditions & replicates and 2) across proteins. First, total sequencing reads were normalized to 200k total across all sample conditions (presorted, -inducer, and +inducer) and replicates for each homolog, allowing comparisons within a single protein. Next, reads were normalized again to account for differences in the theoretical size of each protein’s single-mutant library, allowing for comparisons across proteins by drawing a commont readcount cutoff. For example, total sequencing reads of RolR (4,332 possible mutants) increased by 1.18x relative to MphR (3,667 possible mutants) for a total of 236k reads.

      Changes to manuscript: We have provided substantial additional details in the Fluorescence-activated cell sorting and NGS preparation and analysis sections.

      We also added the following in the main text.

      “In other words, we use cell sorting as a binary classifier i.e., does the mutation disrupt allostery or not. We capture the effect size on individual residues, not individual mutations, by counting the number of dead mutations at a residue position. This is an important consideration because it safeguards us from minor inconsistencies that inevitably arise from cell sorting.”

      Depending on the noise in the data (as captured in the nucleotide-specific q-scores) and the number of nucleotides changed relative to the WT (anywhere between 1-3 for a given amino acid mutation) one might have more or less chance of observing five reads for a given mutation simply due to sequencing noise.

      All the reads considered in our analyses pass the Illumina quality threshold of Q-score ≥ 30 which as per Illumina represent “perfect reads with no errors or ambiguities”. This translates into a probability of 1 in 1000 incorrect base call or 99.9% base call accuracy.

      We use chip-based oligonucleotides to build our DMS library, which allows us to prespecify the exact codon that encodes a point mutation. This means the nucleotide count and protein count are the same. The scenario referred to by the reviewer i.e., “anywhere between 1-3 for a given amino acid mutation” only applies to codon randomized or errorprone PCR library generation. We regret if the chip-based library assembly part was unclear.

      Depending on the shape and separation of the induced (fluorescent) and uninduced (non-fluorescent) population distributions, one might have more or less chance of observing five reads by chance in the gated non-fluorescent region. The current single threshold does not account for variation in the dynamic range of the assay across homologs.

      We have addressed the concern raised by the reviewer on fluorescent population distributions in answers to questions 10 and 11.

      The reviewer makes an important point about the choice of sequencing threshold. We use the sequencing threshold to simply make a binary choice for whether a certain variant exists in the sorted population or not. We do not use the sequencing reads as to scale the activity of the variant. To address the reviewer's comment, we have included a new supplementary figure (Fig 1 – figure supplement 4) where we compare the data by adjust the threshold two levels – 5 and 10 reads. As is evident in the new figure, the fundamental pattern of allosteric hotspots and the overall data interpretation does not change.

      TetR: 5x – 53 hotspots, 10x – 51 hotspots

      TtgR: 5x – 51 hotspots, 10x – 51 hotspots

      MphR: 5x – 48 hotspots, 10x – 48 hotspots

      RolR: 5x – 57 hotspots, 10x – 60 hotspots

      In other words, changing the threshold to be more or less strict may have a modest impact on the overall number of hotspots in the dataset. Still, the regions of functional importance are consistent across different thresholds. We have expanded the discussion in the manuscript to address this point.

      Changes to manuscript: We have now included a new supplementary comparing hotspot data at two thresholds: Figure 1 – figure supplement 4.

      We also added the following in the main text.

      “To assess the robustness of our classification of hotspots, we determined the number of hotspots at two different sequencing thresholds – 5x and 10x. At 5x and 10x, the number of hotspots are – TetR: 53, 51; TtgR: 51, 51; MphR: 48, 48 and RolR: 57,60, respectively. Changing the threshold has a modest impact on the overall number of hotspots and the regions of functional importance are consistent at both thresholds”

      The authors provide a brief written description of the "weighted score" used to define allosteric hotspots (see y-axis for figure 1B), but without an equation, it is not clear what was calculated. Nonetheless, understanding this weighted score seems central to their definition of allosteric hotspots.

      We regret the lack of clarity in our presentation. The weighted score was used to quantify the “deadness” of every residue position in the protein. At each position in the protein, the number of mutations that inhibited activity was summed up and the ‘deadness’ of each mutation was weighted based on how many replicates is appeared to inactivate the protein. Weighted score at each residue position is given by

      Where at position x in the protein, D1 is the number of mutations dead in one replicate only, D2 is the number of mutations dead in 2 replicates, D3 is the number of mutations dead in 3 replicates, and Total is the total number of variants present in the data set (based on sequencing data). Any dead mutation that is seen in only one replicate is discarded and does not contribute to the “deadness” of the residue. Mutations seen in two and three replicates contribute to the score. We have included a new supplementary figure (Fig. 1 – figure supplement 2) to give the reader a detailed heatmap of all mutations and their impact for each protein.

      Changes to manuscript: The weighted scoring scheme is now described in greater detail under Materials and Methods in the “NGS preparation and analysis” section.

      The authors do not provide some of the standard "controls" often used to assess deep mutational scanning data. For example, one might expect that synonymous mutations are not categorized as allosterically dead using their methods (because they should still respond to ligand) and that most nonsense mutations are also not allosterically dead (because they should no longer repress GFP under either condition). In general, it is not clear how the authors validated the assay/confirmed that it is giving the expected results.

      As we state in response to question 12, we use chip-based oligonucleotides to build our DMS library, which allows us to pre-specify the exact codon that encodes a point mutation. We have no synonymous or nonsense mutations in our DMS library. Each protein mutation is encoded by a single unique codon. The only stop codon is at 3’end of the gene.

      The authors performed three replicates of the experiment, but reproducibility across replicates and noise in the assay is not presented/discussed.

      Changes to manuscript: A new supplementary table (Table 1) is now provided with the pairwise correlation coefficients between all replicates for each protein.

      In the analysis of long-range interactions, the authors assert that "hotspot interactions are more likely to be long-range than those of non-hotspots", but this was not accompanied by a statistical test (Figure 2 - figure supplement 1).

      In response to the reviewer's comment, we now include a paired t-test comparing nonhotspots and hotspots with long-range interactions in the main text.

      Changes to manuscript: In all four aTFs, hotspots constituted a higher fraction of LRIs than non-hotspots (Figure 2 – figure supplement 1; P = 0.07).

    1. Author Response

      Reviewer #1 (Public Review):

      In this study, the authors describe an elegant genetic screen for mutants that suppress defects of MCT1 deletions which are deficient in mitochondrial fatty acid synthesis. This screen identified many genes, including that for Sit4. In addition, genes for retrograde signaling factors (Rtg1, Rtg2 and Rtg3), proteins influencing proteasomal degradation (Rpn4, Ubc4) or ribosomal proteins (Rps17A, Rps29A) were found. From this mix of components, the authors selected Sit4 for further analysis. In the first part of the study, they analyzed the effect of Sit4 in context of MCT1 mutant suppression. This more specific part is very detailed and thorough, the experiments are well controlled and convincing. The second, more general part of the study focused on the effect of Sit4 on the level of the mitochondrial membrane potential. This part is of high general interest, but less well developed. Nevertheless, this study is very interesting as it shows for the first time that phosphate export from mitochondrial is of general relevance for the membrane potential even in wild type cells (as long as they live from fermentation), that the Sit4 phosphatase is critical for this process and that the modulation of Sit4 activity influences processes relying on the membrane potential, such as the import of proteins into mitochondria. However, some aspects should be further clarified.

      1) It is not clear whether Sit4 is only relevant under fermentative conditions. Does Sit4 also influence the membrane potential in respiring cells? Fig. S2D shows the membrane potential in glucose and raffinose. Both carbon sources lead to fermentative growths. The authors should also test whether Sit4 levels influence the membrane potential when cells are grown under respirative conditions, such in ethanol, lactate or glycerol. Even if deletions of Sit4 affect respiration, mutants with altered activity can be easily analyzed.

      sit4Δ cells fail to grow on nonfermentable media as shown by us (Figure 2—figure supplement 1C) and others (Arndt et al., 1989; Dimmer et al., 2002; Jablonka et al., 2006). In our opinion, the exact reason is unclear, but there is an interesting observation that addition of aspartate can partially restore growth on ethanol (Jablonka et al., 2006). Despite the lack of thorough investigation on this sit4Δ defect, an early study speculated that this defect could be related to the cAMP-PKA pathway (Sutton et al., 1991). This study pointed out genetic interactions of SIT4 with multiple genes in cAMP-PKA (Sutton et al., 1991). In addition, sit4Δ cells have similar phenotypes as those cAMP-PKA null mutants, such as glycogen accumulation, caffeine resistant, and failure to grow on nonfermentable media (Sutton et al., 1991). We have not found sit4Δ mutants that could grow on nonfermentable media based on literature search.

      2) The authors should give a name to the pathway shown in Fig. 4D. This would make it easier to follow the text in the results and the discussion. This pathway was proposed and characterized in the 90s by George Clark-Walker and others, but never carefully studied on a mechanistic level. Even if the flux through this pathway cannot be measured in this study, the regulatory role of Sit4 for this process is the most important aspect of this manuscript.

      We now refer this mechanism as the mitochondrial ATP hydrolysis pathway.

      3) To further support their hypothesis, the authors should show that deletion of Pic1 or Atp1 wipes out the effect of a Sit4 deletion. In these petite-negative mutants, the phosphate export cycle cannot be carried out and thus, Sit4, should have no effect.

      The mitochondrial phosphate transport activity is electroneutral as it also pumps a proton together with inorganic phosphate. The F1 subunit of the ATP synthase (Atp1 and Atp2) is suggested among many literatures to be responsible for the ATP hydrolysis. We performed tetrad dissection to generate atp1Δ or atp2Δ in pho85Δ background. After streaking the single colony to a fresh plate, we noticed that atp1Δ mct1Δ and atp2Δ mct1Δ cells are lethal, and knocking out PHO85 rescued this synthetic lethality. It is not surprising that atp1Δ mct1Δ or atp2Δ mct1 Δ cells are lethal since the F1 subunit is important to generate a minimum of MMP in mct1 Δ cells when the ETC is absent (i.e., rho0 cells). However, knocking out PHO85 can generate MMP independent of F1 subunit of ATP synthase, which is suggested by the viable atp1Δ mct1Δ pho85Δ and atp2Δ mct1Δ pho85Δ cells. There are many ATPases in the mitochondrial matrix that could hydrolyze ATP for ADP/ATP carrier to generate MMP theoretically. However, we do not currently know exactly which ATPase(s) is activated by phosphate starvation. This data is now included as Figure 5—figure supplement 1F-G.

      4) What is the relevance of Sit4 for the Hap complex which regulates OXPHOS gene expression in yeast? The supplemental table suggests that Hap4 is strongly influenced by Sit4. Is this downstream of the proposed role in phosphate metabolism or a parallel Sit4 activity? This is a crucial point that should be addressed experimentally.

      To investigate the role of the Hap complex in MMP generation in sit4Δ cells, we overexpressed and knocked out HAP4, the catalytic subunit of the Hap complex, separately in wild-type and sit4Δ cells. We confirmed the HAP4 overexpression by the enriched abundance of ETC complexes as shown in the BN-PAGE (Figure 2—figure supplement 1E). However, we did not observe any rescue of ETC or ATP synthase in mct1Δ cells when HAP4 was overexpressed. The enriched level of ETC complexes by HAP4 overexpress is not sufficient to rescue the MMP (Figure 2—figure supplement 1F).

      Next, we knocked out HAP4 in sit4Δ cells. Knocking out SIT4 could still increase MMP in hap4Δ cells with a much-reduced magnitude, which phenocopied ETC subunit and RPO41 deletion in sit4Δ cells (Figure 2—figure supplement 1G).

      In conclusion, the Hap complex is involved in the MMP increase when SIT4 is absent. However, it is not sufficient to increase MMP by overexpressing HAP4. The Hap complex discussion is now included in the manuscript, and the data is presented as Figure 2—figure supplement 1E-G.

      5) The authors use the accumulation of Ilv2 precursors as proxy for mitochondrial protein import efficiency. Ilv2 was reported before as a protein which, if import into mitochondria is slow, is deviated into the nucleus in order to be degraded (Shakya,..., Hughes. 2021, Elife). Is it possible that the accumulation of the precursor is the result of a reduced degradation of pre-Ilv2 in the nucleus rather than an impaired mitochondrial import? Since a number of components of the ubiquitin-proteasome system were identified with Sit4 in the same screen, a role of Sit4 in proteasomal degradation seems possible. This should be tested.

      We thank the reviewer for pointing out this potential caveat with our Ilv2-FLAG reporter. With limited search and tests, we could not find another reporter that behaves like Ilv2FLAG. The reason Ilv2-FLAG is a perfect reporter for this study is because in wild-type cells, Ilv2-FLAG is not 100% imported. Therefore, we could demonstrate that mitochondria with higher MMP import more efficiently. Unfortunately, all of the mitochondrial proteins that we tested could efficiently import in wild-type cells. To identify other suitable mitochondrial proteins that behave like Ilv2-FLAG, we would need to conduct a more comprehensive screen.

      To address the concern of the involvement of protein degradation in obscuring the interpretation of Ilv2-FLAG import, we performed two experiments. First, we measured the proteasomal activity in wild-type and our mutants using a commercial kit (Cayman). We did not observe a statistically significant difference in 20S proteasomal activity between wild-type and sit4Δ cells.

      In the second experiment, we reduced the MMP of sit4 cells using CCCP treatment and measured the Ilv2-FLAG import. We first treated sit4Δ cells with different dosage of CCCP for six hours and measured their MMP. sit4Δ cells treated with 75 µM CCCP had comparable MMP to wild-type cells. When we treated sit4Δ cells with higher concentrations of CCCP, most of the cells did not survive after six hours. Next, we performed the Ilv2-FLAG import assay. We observed similar level of unimported Ilv2FLAG (marked with *) in sit4Δ cells treated with 75 µM CCCP. This result confirms that sit4Δ cells have similar Ilv2-FLAG turnover mechanism and activity as the wild-type cells, because when we lower the MMP in sit4Δ background we observe a similar level of unimported Ilv2-FLAG. We thus feel confident in concluding that the Ilv2-FLAG import results are indeed an accurate proxy for MMP level. These data are now included as Figure 1—figure supplement 1H-J in the manuscript.

      Author response image 1.

      Reviewer #2 (Public Review):

      This study reports interesting findings on the influence of a conserved phosphatase on mitochondrial biogenesis and function. In the absence of it, many nucleus-encoded mitochondrial proteins among which those involved in ATP generation are expressed much better than in normal cells. In addition to a better understanding of th mechanisms that regulate mitochondrial function, this work may help developing therapeutic strategies to diseases caused by mitochondrial dysfunction. However there are a number of issues that need clarification.

      1) The rationale of the screening assay to identify genes required for the gene expression modifications observed in mct1 mutant is not clear. Indeed, after crossing with the gene deletion libray, the cells become heterozygote for the mct1 deletion and should no longer be deficient in mtFAS. Thank you for clarifying this and if needed adjust the figure S1D to indicate that the mated cells are heterozygous for the mct1 and xxx mutations.

      We updated the methods section and the graphic for the genetic screen to clarify these points within the SGA workflow overview. After we created the heterozygote by mating mct1Δ cells with the individual KO cells in the collection, these diploids underwent sporulation and selection for the desired double KO haploid. As a result, the luciferase assay was performed in haploid cells with MCT1 and one additional non-essential gene deleted.

      2) The tests shown in Fig. S1E should be repeated on individual subclones (at least 100) obtained after plating for single colonies a glucose culture of mct1 mutant, to determine the proportion of cells with functional (rho+) mtDNA in the mct1 glucose and raffinose cultures. With for instance a 50% proportion of rho- cells, this could substantially influence the results of the analyses made with these cells (including those aiming to evaluate the MMP).

      We agree that this would provide a more confident estimate for population-level characterization of these colonies. It is important to note that we randomly chose 10 individual subclones, and 100% of these colonies were verified to be rho+. This suggests the population has functional mtDNA, and thus felt confident in the identity of our populations.

      3) The mitochondria area in mct1 cells (Fig.S1G) does not seem to be consistent with the tests in Fig. 1C. that indicate a diminished mitochondrial content in mct1 cells vs wild-type yeast. A better estimate (by WB for instance) of the mitochondrial content in the analyzed strains would enable to better evaluate MMP changes monitored with Mitotracker since the amount of mitochondria in cells correlate with the intensity of the fluorescence signal.

      As this reviewer pointed out, we quantified mitochondrial area based on Tom70-GFP signal. This measurement is quantified by mitochondrial area over cell size. Cell size is an important parameter when measuring organelle size as most of the organelles scale up and down with the cell size. mct1Δ cells generally have smaller cell size than WT cells. Therefore, the mitochondrial area of mct1Δ cells was not significantly different from WT cells when scaled to cell size. We believe this is the best method to compare mitochondrial area. As for quantifying MMP from these microscopy images, we measured the average MitoTracker Red fluorescence intensity of each mitochondria defined by Tom70-GFP. This method inherently normalizes to subtract the influence of mitochondria area when quantifying MMP.

      4) Page 12: "These data demonstrate that loss of SIT4 results in a mitochondrial phenotype suggestive of an enhanced energetic state: higher membrane potential, hyper-tubulated morphology and more effective protein import." Furthermore, the sit4 mutant shows higher levels of OXPHOS complexes compared to WT yeast.

      Despite these beneficial effects on mitochondria, the sit4 deletion strain fails to grow on respiratory substrates. It would be good to know whether the authors have some explanation for this apparent contradiction.

      We agree that this was initially puzzling. We provide a more complete explanation above (see comments to reviewer #1 - major concern #1). Briefly, the growth deficiency in non-fermentable media with sit4Δ cells was reported and studied by multiple groups (Arndt et al., 1989; Dimmer et al., 2002; Jablonka et al., 2006). These seems to indicate that sit4Δ cells contain more ETC complexes and more OCR but cannot respire on nonfermentable carbon source. However, we do not think there is yet a clear explanation for this phenotype. One interesting observation reported is the addition of aspartate partly restoring cells’ growth on ethanol (Jablonka et al., 2006). One early study speculates that this defect could be related to the cAMP-PKA pathway. Sutton et al. pointed out genetic interactions with sit4 and multiple genes in cAMP-PKA (Sutton et al., 1991). In addition, sit4Δ cells have similar phenotypes as those cAMP-PKA null mutants, such as glycogen accumulation, caffeine resistance, and failure to grow on non-fermentable media. However, to keep this manuscript succinct, we opted to stay focused on MMP.

      Reviewer #3 (Public Review):

      In this study, the authors investigate the genetic and environmental causes of elevated Mitochondrial Membrane Potential (MMP) in yeast, and also some physiological effects correlated with increased MMP.

      The study begins with a reanalysis of transcriptional data from a yeast mutant lacking the gene MCT1 whose deletion has been shown to cause defects in mitochondrial fatty acid synthesis. The authors note that in raffinose mct1del cells, unlike WT cells, fail to induce expression of many genes that code for subunits of the Electron Transport Chain (ETC) and ATP synthase. The deletion of MCT1 also causes induction of genes involved in acetyl-CoA production after exposure to raffinose. The authors therefore conduct a screen to identify mutants that suppress the induction of one of these acetylCoA genes, Cit2. They then validate the hits from this screen to see which of their suppressor mutants also reduce expression in four other genes induced in a mct1del strain. This yielded 17 genes that abolished induction of all 5 genes tested in an mct1del background during growth on raffinose.

      The authors chose to focus on one of these hits, the gene coding for the phosphatase SIT4 (related to human PP6) which also caused an increase in expression of two respiratory chain genes. The authors then investigated MMP and mitochondrial morphology in strains containing SIT4 and MCT1 deletions and surprisingly saw that sit4del cells had highly elevated MMP, more reticular mitochondria, and were able to fully import the acetolactate synthase protein Ilv2p and form ETC and ATP synthase complexes, even in cells with an mct1del background, rescuing the low MMP, fragmented mitochondria, low import of Ilv2 and an inability to form ETC and ATP synthase complexes phenotypes of the mct1del strain. Surprisingly, the authors find that even though MMP is high and ETC subunits are present in the sit4del mct1del double deletion strain, that strain has low oxygen consumption and cannot grow under respiratory conditions, indicating that the elevated MMP cannot come from fully functional ETC subunits. The authors also observe that deleting key subunits of ETC complex III (QCR2) and IV (COX5) strongly reduced the MMP of the sit4del mutant, which would suggest that the majority of the increase in MMP of the sit4del mutant was dependant on a partially functional ETC. The authors note that there was still an increase in MMP in the qcr2del sit4del and cox4del sit4del strains relative to qcr2del and cox4del strains indicating that some part of the increase in MMP was not dependent on the ETC.

      The authors dismiss the possibility that the increase in MMP could have been through the reversal of ATP synthase because they observe that inhibition of ATP synthase with oligomycin led to an increase of MMP in sit4del cells. Indicating that ATP synthase is operating in a forward direction in sit4del cells.

      Noting that genes for phosphate starvation are induced in sit4del cells, the authors investigate the effects of phosphate starvation on MMP. They found that phosphate starvation caused an increase in MMP and increased Ilv2p import even in the absence of a mitochondrial genome. They find that inhibition of the ADP/ATP carrier (AAC) with bongkrekic acid (BKA) abolishes the increase of MMP in response to phosphate starvation. They speculate that phosphate starvation causes an increase in MMP through the import and conversion of ATP to ADP and subsequent pumping of ADP and inorganic phosphate out of the mitochondria.

      They further show that MMP is also increased when the cyclin dependent kinase PHO85 which plays a role in phosphate signaling is deleted and argue that this indicates that it is not a decrease in phosphate which causes the increase in MMP under phosphate starvation, but rather the perception of a decrease in phosphate as signalled through PHO85. Unlike in the case of SIT4 deletion, the increase in MMP caused by the deletion of pho85 is abolished when MCT1 is deleted.

      Finally they show an increase in MMP in immortalized human cell lines following phosphate starvation and treatment with the phosphate transporter inhibitor phosphonoformic acid (PFA). They also show an increase in MMP in primary hepatocytes and in midgut cells of flies treated with PFA.

      The link between phosphate starvation and elevated MMP is an important and novel finding and the evidence is clear and compelling. Based on their experiments in various mammalian contexts, this link appears likely to be generalizable, and they propose and begin to test an interesting hypothesis for how MMP might occur in response to phosphate starvation in the absence of the Electron Transport Chain.

      The link between phosphate starvation and deletion of the conserved phosphatase SIT4 is also interesting and important, and while the authors' experiments and analysis suggest some connection between the two observations, that connection is still unclear.

      Major points

      Mitotracker is great fluorescent dye, but it measures membrane potential only indirectly. There is a danger when cells change growth rates, ion concentrations, or when the pH changes, all MMP indicating dyes change in fluorescence: their signal is confounded Change in phosphate levels can possibly do both, alter pH and ion concentrations. Because all conclusions of the manuscript are based on a change in MMP, it would be a great precaution to use a dye-independent measure of membrane potential, and confirm at least some key results.

      Mitochondrial MMP does strongly influence amino acid metabolism, and indeed the SIT4 knockout has a quite striking amino acid profile, with histidine, lysine, arginine, tyrosine being increased in concentration. http://ralser.charite.de/metabogenecards/Chr_04/YDL047W.html Could this amino acid profile support the conclusions of the authors? At least lysine and arginine are down in petites due to a lack of membrane potential and iron sulfur cluster export.- and here they are up. Along these lines, according to the same data resource, the knock-outs CSR2, ASF1, SSN8, YLR0358 and MRPL25 share the same metabolic profile. Due to limited time I did not re-analyse the data provided by the authors- but it would be worth checking if any of these genes did come up in the screens of the authors.

      We tested the mutants within the same cluster as SIT4 shown in this paper from the deletion collection and measured their MMP. yrl358cΔ cells have similar high MMP as observed in sit4Δ cells. However, this gene has a yet undefined function. Beyond YRL358C, we did not observe similar MMP increases in other gene deletions from this panel, which does not support the notion that amino acids such as histidine, lysine, arginine, or tyrosine play a determining effect in driving MMP.

      The media condition and strain used in the suggested paper is very different from what we used in our study. Instead of growing prototrophic cells in minimal media without any amino acids, we used auxotrophic yeast strains and grew them in media containing complete amino acids. So far, none of the other defects or signaling associated with SIT4 deletion could influence MMP as much as the phosphate signaling. We interpret these data to support the hypothesis that the MMP observation in sit4Δ cells is connected with the phosphate signaling as illustrated by the second half of the story in our manuscript.

      Author reponse image 2.

      One important claim in the manuscript attempts to explain a mechanism for the MMP increase in response to phosphate starvation which is independent of the ETC and ATP synthase.

      It seems to me the only direct evidence to support this claim is that inhibition of the AAC with BKA stops the increase of mitotracker fluorescence in response to phosphate starvation in both WT and rho0 cells (Figs 4B and 4C). It would strengthen the paper if the authors could provide some orthogonal evidence.

      This is a similar comment as raised by reviewer #1 - major concern #3. We refer the reviewer to our discussion and the new data above. Briefly, we do not think F1 subunit is responsible for the ATP hydrolysis activity to generate MMP in phosphate depleted situation. We believe there are additional ATPase(s) in the mitochondrial matrix that can be utilized to couple to ADP/ATP carrier for MMP generation during phosphate starvation. However, we have not identified the relevant ATPase(s) at this point, and it is likely that multiple ATPases could contribute to this activity.

      Introduction/Discussion The author might want to make the reader of the article aware that the 'reversal' of the ATP synthase directionality -i.e. ATP hydrolysis by the ATP synthase as a mechanism to create a membrane potential (in petites), has always been a provocative idea - but one that thus far could never be fully substantiated. Indeed some people that are very familiar with the topic, are skeptical this indeed happens. For instance, Vowinckel et al 2021 (PMID: 34799698) measured precise carbon balances for peptide cells, and found no evidence for a futile cycle - peptides grow slower, but accumulate the same biomass from glucose as peptides that re-evolve at a fast growth rate . Perhaps the manuscript could be updated accordingly.

      We thank the reviewer for pointing out this additional relevant study. We have rephased the referenced sentence in the introduction. The MMP generation in phosphate starvation is independent of the F1 portion of ATP synthase. Therefore, our data neither supports or refutes either of these arguments.

      In the introduction and conclusion there is discussion of MMP set points. In particular the authors state:

      "Critically, we find that cells often prioritize this MMP setpoint over other bioenergetic priorities, even in challenging environments, suggesting an important evolutionary benefit."

      This does not seem to be consistent with the central finding of the manuscript that MMP changes under phosphate starvation. MMP doesn't seem so much to have a 'set point' but rather be an important physiological variable that reacts to stimuli such as phosphate starvation.

      The reviewer raises a rational alternative hypothesis to the one that we have proposed. In reality, both of these are complete speculations to explain the data and we can’t think of any way to test the evolutionary basis for the mechanisms that we describe. We recognize that untested/untestable speculative arguments have limitations and there are viable alternative hypotheses. We have softened our language to ensure that it is clear that this is only a speculation.

      The authors suggest that deletion of Pho85 causes an increase in MMP because of cellular signaling. However, they also state in the conclusion:

      "Unlike phosphate starvation, the pho85D mutant has elevated intracellular phosphate concentrations. This suggests that the phosphate effect on MMP is likely to be elicited by cellular signaling downstream of phosphate sensing rather than some direct effect of environmental depletion of phosphate on mitochondrial energetics."

      The authors should cite the study that shows deletion of PHO85 causes increased intracellular phosphate concentrations. It also seems possible that the 'cellular signaling' that causes the increase in MMP could be a result of this increase in intracellular phosphate concentrations, which could constitute a direct effect of an environmental overload of phosphate on mitochondrial energetics.

      We now cited the literature that shows higher intracellular phosphate in pho85Δ cells (Gupta et al., 2019; Liu et al., 2017). Depleting phosphate in the media drastically reduced intracellular phosphate concentration, which is the opposing situation as pho85Δ cells. Nevertheless, we observed higher MMP in either situation. We concluded from these two observations that the increase in MMP is a response to the signaling activated by phosphate depletion rather than the intracellular phosphate abundance.

      Related to this point, in the conclusion, the authors state:

      "We now show that intracellular signaling can lead to an increased MMP even beyond the wild-type level in the absence of mitochondrial genome."

      In sum, the data shows that signaling is important here- but signaling alone is only the message - not the biophysical process that creates a membrane potential. The authors then could revise this slightly.

      We have rephrased this sentence as suggested, which now reads “We now show that intracellular signaling triggers a process that can lead to an increased MMP even beyond the wild-type level in the absence of mitochondrial genome”.

      The authors state in the conclusion that

      "We first made the observation that deletion of the SIT4 gene, which encodes the yeast homologue of the mammalian PP6 protein phosphatase, normalized many of the defects caused by loss of mtFAS, including gene expression programs, ETC complex assembly, mitochondrial morphology, and especially MMP (Fig. 1)"

      The data shown though indicates that a defect in mtFAS in terms of MMP, deletion of SIT4 causes a huge increase (and departure away from normality) whether or not mct1 is present (Fig 1D)

      We changed the word “normalized” to “reversed”. In the discussion section, we also emphasized that many of these increases are independent of mitochondrial dysfunction induced by loss of mtFAS.

      The language "SIT4 is required for both the positive and negative transcriptional regulation elicited by mitochondrial dysfunction" feels strong. SIT4 seems to influence positive transcriptional regulation in response to mitochondrial dysfunction caused by MCT1 deletion (but may not be the only thing as there appears to be an increase in CIT2 expression in a sit4del background following a further deletion of MCT1). In terms of negative regulation, SIT4 deletion clearly affects the baseline, but MCT1 deletion still causes down regulation of both examples shown in Fig 1B, showing that negative transcriptional regulation can still occur in the absence of SIT4. The authors might consider showing fold change of expression as they do in later figures (Figs 4B and C) to help the reader evaluate the quantitative changes they demonstrate.

      We now displayed the fold change as suggested. This sentence now reads “These data suggest that SIT4 positively and negatively influences transcriptional regulation elicited by mitochondrial dysfunction”.

      The authors induce phosphate starvation by adding increasing amounts of potassium phosphate monobasic at a pH of 4.1 to phosphate dropout media supplemented with potassium. The authors did well to avoid confounding effects of removing potassium. The final pH of YNB is typically around 5.2. Is it possible that the authors are confounding a change in pH with phosphate starvation? One would expect the media in the phosphate starvation condition to have a higher pH than the phosphate replacement or control media. Is a change in pH possibly a confounding factor when interpreting phosphate starvation? Perhaps the authors could quantify the pH of the media they use for the experiment to understand how much of a factor that could be. One needs to be careful with Miotracker and any other fluorescent dye when pH changes. Albeit having constraints on its own, MitoLoc as a protein rather than small molecule marker of MMP might be a good complement.

      We followed the protocol used by many other studies that depleted phosphate in the media. The reason we and others adjusted the media without inorganic phosphate to a pH of 4.1 is because that is the pH of phosphate monobasic. From there, we could add phosphate monobasic to create +Pi media without changing the media pH. Therefore, media containing different concentrations of phosphate all have the exact same pH. We now emphasize that all media containing different levels of inorganic phosphate have the same pH to the manuscript to eliminate such concern (see page 18).

      Even though all media have the similar pH, we also provided complementary data using a parallel approach to measure the MMP by assessing mitochondrial protein import as demonstrated previously with Ilv2-FLAG, which shares the same principle as mitoLoc.

      Reference

      Arndt, K. T., Styles, C. A., & Fink, G. R. (1989). A suppressor of a HIS4 transcriptional defect encodes a protein with homology to the catalytic subunit of protein phosphatases. Cell, 56(4), 527–537. https://doi.org/10.1016/00928674(89)90576-X

      Dimmer, K. S., Fritz, S., Fuchs, F., Messerschmitt, M., Weinbach, N., Neupert, W., & Westermann, B. (2002). Genetic basis of mitochondrial function and morphology in Saccharomyces cerevisiae. Molecular Biology of the Cell, 13(3), 847–853. https://doi.org/10.1091/mbc.01-12-0588

      Gupta, R., Walvekar, A. S., Liang, S., Rashida, Z., Shah, P., & Laxman, S. (2019). A tRNA modification balances carbon and nitrogen metabolism by regulating phosphate homeostasis. ELife, 8, e44795. https://doi.org/10.7554/eLife.44795

      Jablonka, W., Guzmán, S., Ramírez, J., & Montero-Lomelí, M. (2006). Deviation of carbohydrate metabolism by the SIT4 phosphatase in Saccharomyces cerevisiae. Biochimica et Biophysica Acta (BBA) - General Subjects, 1760(8), 1281–1291. https://doi.org/10.1016/j.bbagen.2006.02.014

      Liu, N.-N., Flanagan, P. R., Zeng, J., Jani, N. M., Cardenas, M. E., Moran, G. P., & Köhler, J. R. (2017). Phosphate is the third nutrient monitored by TOR in Candida albicans and provides a target for fungal-specific indirect TOR inhibition. Proceedings of the National Academy of Sciences, 114(24), 6346–6351. https://doi.org/10.1073/pnas.1617799114

      Sutton, A., Immanuel, D., & Arndt, K. T. (1991). The SIT4 protein phosphatase functions in late G1 for progression into S phase. Molecular and Cellular Biology, 11(4), 2133–2148.

    1. Author Response

      Reviewer #1 (Public Review):

      This study provides further detailed analysis of recently published Fly Atlas datasets supplemented with newly generated single cell RNA-seq data obtained from 6,000 testis cells. Using these data, the authors define 43 germline cell clusters and 22 somatic cell clusters. This work confirms and extends previous observations regarding changing gene expression programs through the course of germ cell and somatic cell differentiation.

      This study makes several interesting observations that will be of interest to the field. For example, the authors find that spermatocytes exhibit sex chromosome specific changes in gene expression. In addition, comparisons between the single nucleus and single cell data reveal differences in active transcription versus global mRNA levels. For example, previous results showed that (1) several mRNAs remain high in spermatids long after they are actively transcribed in spermatocytes and (2) defined a set of post-meiotic transcripts. The analysis presented here shows that these patterns of mRNA expression are shared by hundreds of genes in the developing germline. Moreover, variable patterns between the sn- and sc-RNAseq datasets reveals considerable complexity in the post-transcriptional regulation of gene expression.

      Overall, this paper represents a significant contribution to the field. These findings will be of broad interest to developmental biologists and will establish an important foundation for future studies. However, several points should be addressed.

      In figure 1, I am struck by the widespread expression of vasa outside of the germ cell lineage. Do the authors have a technical or biological explanation for this observation? This point should be addressed in the paper with new experiments or further explanation in the text.

      Thank you for pointing this out. We found that our single cell dataset shows a similar (low) level of vasa expression outside the germline, suggesting that this is not due to single nucleus versus single cell RNA-seq (cluster 1, red in the lefthand umap).

      Analyzing the single nucleus RNA-seq in more detail revealed that, compared to the germline, both the fraction of cells in a cluster expressing vasa and the level at which they express it are very low. This analysis is included in a new Figure 1 – figure supplement 1. It is likely that much of this is due to a technical artifact, such as ambient RNA. Finally, we note in the resubmission that vasa is in fact expressed in embryonic somatic cells, and thus some of the vasa expression we observe may be real (Renault. Biol Open 2012; https://doi.org/10.1242/bio.20121909).

      Plots in the original submission drew undue attention to the few somatic cells that exhibited vasa signal, due to the fact that expressing cell points were forced to the front of the plot. Given our new analysis reporting the low levels and fraction of cells exhibiting vasa expression (Figure 1 – figure supplement 1), we have modified the panels of Figure 1, changing point size to more faithfully reflect the small proportion of somatic cells with some vasa expression.

      The proposed bifurcation of the cyst cells into head and tail populations is interesting and worth further exploration/validation. While the presented in situ hybridization for Nep4, geko, and shg hint at differences between these populations, double fluorescent in situs or the use of additional markers would help make this point clearer. Higher magnification images would also help in this regard.

      We thank the reviewer for their suggestions on clarifying the differences between HCC and TCC populations. As suggested, we have repeated the FISH experiments of Nep4 and geko with higher resolution, and included the additional marker Coracle that demarcates the junction between HCC and TCC (Figure 6O,Q,S,T). These panels replaced previous Nep4 and geko FISH images (see previous Figure 6Q,U,U’). FISH for Nep4 validated the split, and the enrichment of geko strongly suggests that this arm represents one cell type (HCCs). We have not yet identified a gene reciprocally enriched to the other arm. Therefore, in the revised submission, we call the assignment of TCC identity, and to a lesser extent, HCC identity ‘tentative’, but point out that genes predicted to be enriched to one or the other arm represent fertile candidates for the field to test.

      Reviewer #2 (Public Review):

      In this manuscript the authors explain in greater detail a recent testis snRNAseq dataset that many of these authors published earlier this year as part of the Fly Cell Atlas (FCA) Li et al. Science 2022. As part of the current effort additional collaborators were recruited and about 6,000 whole cell scRNAseq cells were added to the previous 42,000 nuclei dataset. The authors now describe 65 snRNseq clusters, each representing potential cell types or cell states, including 43 germline clusters and 22 somatic clusters. The authors state that this analysis confirms and extends previously knowledge of the testis in several important areas.

      “However, in areas where testis biology is well studied, such as the development of germ cells from GSC to the onset of spermatocyte differentiation, the resolution seems less than current knowledge by considerable margins. No clusters correspond to GSCs, or specific mitotic spermatogonia, and even the major stages of meiotic prophase are not resolved. Instead, the transitions between one state and the next are broad and almost continuous, which could be an intrinsic characteristic of the testis compared to other tissues, of snRNAseq compared to scRNAseq, or of the particular experimental and software analysis choices that were used in this study.”

      Note that the referee raises the same issue later in their review also. To respond succinctly, we placed the relevant sentence from a later portion of this referee’s comment here

      “Support for the view that the problems are mostly technical, rather than a reflection of testis biology, comes from studies of scRNAseq in the mouse, where it has been possible to resolve a stem cell cluster, and germ cell pathways that follow known germ cell differentiation trajectories with much more discrete steps than were reported here (for example, Cao et al. 2021 cited by the authors).”

      Respectfully, we have a different interpretation of other work as cited by this referee. Our data, as well as that from others, supports the notion that transitions are generally broad and continuous and are indeed a feature of testis biology. As we report here, data from both single cell and single nucleus RNAseq exhibit transitions from one cluster to the next. Thus, this feature cannot be due to the choice of method (single cell versus single nucleus).

      In fact, prior scRNA-seq results on systems containing a continuously renewing cell population, such as is the case in the testis, do indeed exhibit a contiguous trajectory rather than discrete, well-separated cell states in gene expression space (that is, in a UMAP presentation). For example, this is the case from single-cell or single-nucleus sequencing from spermatogenesis in mouse (Cao et al 2021), human (Sohni et al 2019), and zebrafish (Qian et al 2022).

      Along differentiation trajectories in these tissues, successive clusters are defined by their aggregate, transcript repertoire. Indeed, differentially-expressed genes can be identified for clusters, with expression enriched in a given cluster. However, expression is rarely restricted to a cluster. For instance, Cao et al. subcluster spermatogonia into four subgroups, termed SPG1-4. They state clearly that these SPG1-4 “follow a continuous differentiation trajectory,” as can be inferred by marker expression across cells in this lineage. Similar to our findings, while the spermatogonia can fall into discrete clusters, gene expression patterns are contiguous. For example, the “undifferentiated” marker used in Cao et al, Crabp1, clearly shows expression in SPG1-3, annotated as spermatogonial stem cells, undifferentiated spermatogonia, and early differentiated spermatogonia, respectively. Likewise, markers for the “SPG3” state spermatogonia have detectable expression in SPG2 and SPG4, and likewise for markers of the “SPG4” state (with expression found also in SPG3). <br /> Analogous study of human spermatogenesis arrives at a similar conclusion. In that work, although clusters are named as “spermatogonial stem cell (SSC)”, the authors are careful to specifically point out that, “…while we refer to the SSC-1 and SSC-2 cell clusters as ‘‘SSCs,’’ scRNA-seq is not a functional assay and thus we do not know the percentage of cells in these clusters with SSC activity. These subsets almost certainly contain other A-SPG cells [A type spermatogonia], including SPG progenitors that have committed to differentiate.” (Sohi et al 2019)

      Thus, the work in several disparate systems, all involving renewing lineages, finds that discrete clusters, such as a “stem cell cluster” are not identified. In the Drosophila testis, germline differentiation flows in a continuous-like manner similar to spermatogenesis in several other organisms studied by scRNA-seq, and our finding is not a function of the methodology, but rather a facet of the biology of the organ.

      Operating in parallel with continuous differentiation, we did find evidence of, and extensively discussed in concert with Figure 4, huge and dramatic shifts in transcriptional state in spermatocytes compared to spermatogonia, in early spermatids compared to spermatocytes, and in late spermatid elongation. Lastly, as we describe further below, new data in this resubmission identify four distinct genes with stage-selective expression as predicted by our analysis (new Figure 2 - figure supplement 1), illustrating the utility of our study for the field to find new markers and new genes to test for function.

      A goal of the study was to identify new rare cell types, and the hub, a small apical somatic cell region, was mentioned as a target region, since it regulates both stem cell populations, GSCs and CySCs, is capable of regeneration, and other fascinating properties. However the analysis of the hub cluster revealed more problems of specificity. 41 or 120 cells in the cluster were discordant with the remaining 79 which did express markers consistent with previous studies. Why these cells co-clustered was not explained and one can only presume that similar problems may be found in other clusters.

      Our writing seems not to have been clear enough on this point and we thank the reviewer. We have revised the section. In addition, we have added new data (Figure 7 - figure supplement 2). We had already stated that only 79 of these 120 nuclei were near to each other in 2D UMAP space, while other members of original cluster 90 were dispersed. Thus the 79 hub nuclei in fact clustered together on the UMAP. Other nuclei that mapped at dispersed positions were initially ‘called’ as part of this cluster in the original Fly Cell Atlas (FCA) paper (Li et al., 2022), making it obvious that a correction to that assignment was necessary, which we carried out. To our eye, no other called cluster was represented by such dispersed groupings. For the hub, we definitively established the 79 nuclei to represent hub cells by marker gene analysis, including the identification of a new maker, tup, that was included in the 79 annotated hub nuclei but excluded from the 41 other nuclei (Figure 7). In this resubmission, to independently verify the relationship of the 79 nuclei to each other, we subjected the 120 nuclei from the original cluster 90 defined by the FCA study to hierarchical clustering using only genes that are highly expressed and variable in these nuclei (Figure 7 - figure supplement 2). This computationally distinct approach strongly supported our identification of the 79 definitive hub nuclei.

      Indeed, many other indications of specificity issues were described, including contamination of fat body with spermatocytes, the expression of germline genes such as Vasa in many somatic cell clusters like muscle, hemocytes, and male gonad epithelium, and the promiscuous expression of many genes, including 25% of somatic-specific transcription factors, in mid to late spermatocytes. The expression of only one such genes, Hml, was documented in tissue, and the authors for reasons not explained did not attempt to decisively address whether this phenomenon is biologically meaningful.

      We discussed the question of vasa expression in somatic clusters in some detail above, in response to referee #1, and included new analysis in the resubmission.

      With respect to the observation of ‘somatic gene’ expression in spermatocytes, we are also intrigued. We do not believe this is due to “contamination,” but rather a spermatocyte expression program that includes expression of somatic genes. First, these somatic markers were not observed in other germline clusters, which would be expected if this was due to general transcript contamination. Second, we observed expression of somatic markers in spermatocytes independently in the single-cell and single-nucleus data, making it unlikely to be an artifact of preparation of isolated nuclei. Finally, in the resubmission, in addition to Hml, we validated ‘somatic’ marker expression in spermatocytes by FISH of a somatic, tail cyst cell marker, Vsx1. Vsx1 is predicted to be expressed at low levels in spermatocytes in our dataset and is clearly visible in germline cells by FISH (Figure 3 – figure supplement 2G,H). We also refer the referee to Figure 6K, where the mRNA for the somatic cyst cell marker eya was observed by FISH at low levels in spermatocytes.

      A truly interesting question mentioned by the authors is why the testis consistently ranks near the top of all tissues in the complexity of its gene expression. In the Li et al. (2022) paper it was suggested that this is due an inherently greater biological complexity of spermiogenesis than other tissues. It seems difficult to independently and rationally determine "biological complexity," but if a conserved characteristic of testis was to promiscuously express a wide range of (random?) genes, something not out of the question, this would be highly relevant and important.

      We agree that the massive transcriptional program found in spermatocytes is, indeed, truly interesting. There are many speculations as to why spermatocytes are so highly transcriptional, including the possibility of “transcriptional scanning” (e.g., Xia et al. 2020) regulating the evolution of new genes. Testing such models is beyond the scope of this paper. However, one must also keep in mind that spermatogenesis involves one of the most dramatic cellular transformations in biology, where cellular components spanning from nuclei to chromatin to Golgi, cell cycle, extensive membrane addition, changes in cell shape, and building of a complex swimming organelle all must occur and be temporally coordinated. Small wonder that many genes must be expressed to accomplish these tasks.

      Unfortunately, the most likely problems are simply technical. Drosophila cells are small and difficult to separate as intact cells. The use of nuclei was meant to overcome this inherent problem, but the effectiveness of this new approach is not yet well-documented. Support for the view that the problems are mostly technical, rather than a reflection of testis biology, comes from studies of scRNAseq in the mouse, where it has been possible to resolve a stem cell cluster, and germ cell pathways that follow known germ cell differentiation trajectories with much more discrete steps than were reported here (for example, Cao et al. 2021 cited by the authors).

      We respectfully disagree with the referee about this collection of statements. First, the use of snRNASeq has been extensively characterized and compared to scRNA-seq in brain tissue by McLaughlin et al., 2021 (cited in the original submission) and was shown to be effective (McLaughlin, et al. eLife 2021;10:e63856. DOI: https://doi.org/10.7554/eLife.63856). snRNA-seq has a distinct advantage when dealing with long, thin cells, such as neurons or cyst cells (as featured in this work), where cytoplasm can easily be sheared off during cell isolation. Second, in a previous portion of our response to this referee, we discussed how our interpretation of Cao et al., 2021 differs from that expressed by this referee. Lastly, as requested in ‘Essential revision’ 2, we adjusted clustering methods and selected four genes, two predicted to be markers for early stage germline cells, and two for mid-spermatocyte stage development. FISH analysis demonstrates that expression for each of these maps to the appropriate stages (new Figure 2 - figure supplement 1). This confirms that the datasets we present in this manuscript can be mined to identify unique, diagnostic markers for various stages.

      The conclusions that were made by the authors seem to either be facts that are already well known, such as the problem that transcriptional changes in spermatocytes will be obscured by the large stored mRNA pool, or promises of future utility. For example, "mining the snRNA-seq data for changes in gene expression as one cluster advances to the next should identify new sub-stage-specific markers." If worthwhile new markers could be identified from these data, surely this could have been accomplished and presented in a supplemental Table. As it currently stands, the manuscript presents the dataset including a fair description of its current limitations, but very little else of novel biological interest is to be found.

      “In sum, this project represents an extremely worthwhile undertaking that will eventually pay off. However, some currently unappreciated technical issues, in cell/nuclear isolation, and certainly in the bioinformatic programs and procedures used that mis-clustered many different cells, has created the current difficulties.

      Most scRNAseq software is written to meet the needs of mammalian researchers working with cultured cells, cellular giants compared to Drosophila and of generally similar size. Such software may not be ideal for much smaller cells, but which also include the much wider variation in cell size, properties and biological mechanisms that exist in the world of tissues.”

      We appreciate the referee’s acknowledgement that this ‘undertaking will eventually pay off’. It was not our intention to address ‘function’ for this study, but rather to make the system accessible to the broadest community possible. We are uncertain if there is any remaining reservation held by this referee. A brief summary of what we covered in the manuscript may help allay any residual concern. Obviously, study of the Drosophila testis and spermatogenesis benefits from the knowledge of a large number of established cell-type and stage-selective markers. Thus, we extensively used the community’s accepted markers to assign identity to clusters in both the sn- and sc-RNA-seq UMAPs. We believe that effort well establishes the validity and reliability of the dataset . Furthermore, we identified upwards of a dozen new markers out of the cluster analysis, and verified their expression by FISH or reporter line in various figures throughout (tup, amph, piwi, geko, Nep4, CG3902, Akr1B, loqs, Vsx1, Drep2, Pxt, CG43317, Vha16-5, l(2)41Ab). To our mind, these contributions, coupled with annotation of the datasets, suggest strongly that they will serve the community well. This is especially true as we provide users with objects that they can feed into commonly used software algorithms such as Seurat and Monocle to explore the datasets to their purposes. Rather than simply relying on default settings within some of the applications, we also adjusted parameters for various clusterings as called for; some of which were in response to astute comments from referees, and included in the resubmission. Of course, it is possible that rare issues may arise in the datasets as these are further studied, but that is the case with all scRNA-seq data, and is not specific to work on this model organism.

      Reviewer #3 (Public Review):

      In this study, the authors use recently published single nucleus RNA sequencing data and a newly generated single cell RNA sequencing dataset to determine the transcriptional profiles of the different cell types in the Drosophila ovary. Their analysis of the data and experimental validation of key findings provide new insight into testis biology and create a resource for the community. The manuscript is clearly written, the data provide strong support for the conclusions, and the analysis is rigorous. Indeed, this manuscript serves as a case study demonstrating best practices in the analysis of this type of genomics data and the many types of predictions that can be made from a deep dive into the data. Researchers who are studying the testis will find many starting points for new projects suggested by this work, and the insightful comparison of methods, such as between slingshot and Monocle3 and single cell vs single nucleus sequencing will be of interest beyond the study of the Drosophila testis.

      We greatly appreciate the reviewer’s comments.

      Reviewer #4 (Public Review):

      This is an extraordinary study that will serve as key resource for all researchers in the field of Drosophila testis development. The lineages that derive from the germline stem cells and somatic stem cells are described in a detail that has not been previously achieved. The RNAseq approaches have permitted the description of cell states that have not been inferred from morphological analyses, although it is the combination of RNAseq and morphological studies that makes this study exceptional. The field will now have a good understanding of interactions between specific cell states in the somatic lineage with specific states in the germ cell lineage. This resource will permit future studies on precise mechanisms of communication between these lineages during the differentiation process, and will serve as a model for studies of co-differentiation in other stem cell systems. The combination of snRNAseq and scRNAseq has conclusively shown differences in transcriptional activation and RNA storage at specific stages of germ cell differentiation and is a unique study that will inform other studies of cell differentiation.

      Could the authors please describe whether genes on the Y chromosome are expressed outside of the male germline. For example, what is represented by the spots of expression within the seminal vesicle observed in Figure 3D?

      Prior work demonstrated that proteins encoded by Y-linked genes are not expressed outside of the germline (Zhang et al. Genetics 2020. https://doi.org/10.1534/genetics.120.303324). In our snRNAseq dataset, we find that genes on the Y chromosome are not highly expressed outside of the male germline (on the order of ~100-fold lower in other tissues). In fact, we observe Y chromosome transcripts at this level in many nuclei across tissues collected for the Fly Cell Atlas project, including the ovary. Since we have not followed up on the Fly Cell Atlas observations directly using FISH to examine Y chromosome transcript expression outside the germline, we cannot rule out the possibility that such low level expression is real. However, the detection across several tissues argues that this is likely technical artifact. With regard to ‘spots of expression within the seminal vesicle’ (Figure 3D), a spot is colored red if the average expression level of genes on the Y chromosome is greater in that cell than in an average cell on our plot. These red spots are likely due to ambient RNA being carried over.

      I would appreciate some discussion of the "somatic factors" that are observed to be upregulated in spermatocytes (e.g. Mhc, Hml, grh, Syt1). Is there any indication of functional significance of any of these factors in spermatocytes?

      This is an excellent question. Although we validated expression for several (Hml, Vsx1 and eya), we did not test for their function here and this issue remains to be studied. This is now directly stated in the main text.

      In the discussion of cyst cell lineage differentiation following cluster 74 the authors state that neither the HCC or TCC lineages were enriched for eya (Figure 6V). It seems in this panel that cluster 57 shows some enrichment for eya - is this regarded as too low expression to be considered enriched?

      We thank the reviewer for their insightful comment and we agree with their conclusions. We have modified the text to reflect the low, but present, expression of eya in the HCC and TCC lineages. The text now reads as follows at line (insert line # here): “Enrichment of eya was dramatically reduced in the clusters along either late cyst cell branch compared to those of earlier lineage nuclei (Figure 6J,U).”

    1. Author Response

      Reviewer #1 (Public Review):

      This paper presents analysis of an impressive dataset acquired from sibling pairs, where one child had a specific gene mutation (22q11.2DS), whereas other child served as a blood-related, healthy control. The authors gathered rich, multi-faced data, including genetic profile, behavioral testing, neuropsychiatric questionnaires, and sleep PSG.

      The analyses explore group differences (gene mutation vs. healthy controls) in terms of sleep architecture, sleep-specific brain oscillations and performance on a memory task.

      The authors utilized a solid mix-model statistical approach, which not only controlled for the multi-comparison problem, but also accounted for between-subject and within-family variance. This was supplemented by mediation analysis, exploring the exact interaction between the variables. Remarkably, the two subject groups were gender balanced, and were matched in terms of age and sex.

      Thank you for this endorsement of our approach.

      There are some aspects requiring clarification. In the discussion section, some claims come across as too general, or too speculative, and lack proper evidence in the current analysis of in the references.

      We have extensively revised our discussion, including introducing more referencing and adding subheadings which we hope makes our conclusions both more structured and better evidenced (Discussion, pages 27 – 31)

      Furthermore, the authors seem to treat their (child) participants with the gene mutation as forerunners of (adult) schizophrenic patients, to whom their repeatedly compare the findings. However, less than half of these children with 22q11.2DS are expected to develop psychotic disorders. In fact, they are at risk of many other neuropsychiatric conditions (incl. intellectual disability, ASD, ADHD, epilepsy) (cf. introduction section).

      We have revised our introduction (page 4 -5) and discussion to clarify the significant comorbidity in 22q11.2DS. We discuss the limitations and future directions section of our work in the discussion (page 30)

      Furthermore, the liberal criteria for detecting slow-waves, along with odd topography of the detections, limit the credibility of the slow-wave-related results.

      As there is no single common method for SW detection, as noted on page 37, we prioritised rate of detection in order to provide a robust dataset for spindle-SW coupling analysis. We considered the use of an absolute detection threshold (e.g. – 75 microVolts) – however, because our participants were of a wide range of ages (6 to 20 years), and it is established that the absolute amplitude of the EEG decreases across childhood (e.g. Hahn et al 2020), our view is that the use of an absolute detection threshold would potentially bias the detection of slow waves by age. We have added comments on this matter to the methods section (page 37)

      Lastly, we cannot be sure whether the presented memory effects reflect between-group difference in general cognitive capacities, or, as claimed, in overnight memory consolidation.

      We have added statistical analysis of the overnight change in performance (results, page 6) to explore this point. We clarify that although 22q11.2DS is associated with slower learning and worse accuracy in the test session, there is not a difference in overnight change in 22q11.2DS.

      Generally, the current study introduces dataset connecting various aspects of 22q11.2DS. It has a great potential for complementing the current state of knowledge not only in the clinical, but also in sleep-science field.

      Thank you

      Reviewer #2 (Public Review):

      This study examines 22q11.2 microdeletion syndrome in 28 individuals and their unaffected siblings. Though the sample size is small, it is on par with many neuroimaging studies of the syndrome. Part of the interest in this disorder arises from the risk this syndrome confers for neuropsychiatric disorders in general and psychosis specifically. The authors examine sleep neurophysiology in 22q11.2DS and their siblings. Principal findings include increase slow wave and spindle amplitudes in deletion carriers as compared to controls.

      Strengths of this manuscript include:

      • The inclusion of siblings as a control group, which minimizes environmental and (other) genetic confounds

      • The data analyses of the sleep EEG are appropriate and in-depth

      • High-density sleep EEG allows for topographic mapping

      We thank the reviewer for this positive endorsement of our work

      Weaknesses of this manuscript include:

      • The manuscript is framed as an investigation of the psychosis and schizophrenia; however, psychotic experiences did not differ between 22q11.2DS and healthy controls. Therefore, the emphasis on schizophrenia and psychosis does not pertain to this sample and the manuscript introduction and discussion should be carefully reframed. The final sentence of the abstract is also not supported by the data: "... out findings may therefore reflect delayed or compromised neurodevelopmental processes which precede, and may be biomarkers for, psychotic disorders".

      We have expanded our abstract, introduction and discussion to reflect the complex neurodevelopment phenotype observed in 22q11.2DS, and discuss the links between our findings, and elements of this phenotype

      • What is the rationale for using a mediation model to test for the association between genotype and psychiatric symptoms? Given the modest sample size would a regression to test the association between genotype and psychiatric symptoms be more appropriate?

      Our rationale for mediation analysis was to expand on making simple group comparisons for various measures by asking if genotype effects on particular psychiatric/behavioural measures were potentially mediated by EEG measures. This is of considerable interest because, as noted above, the behavioural and psychiatric phenotype in 22q11.2DS is complex, and therefore dissection of links between particular EEG features and phenotypes, and asking if EEG measures can be biomarkers for these phenotypes, may give insight into this complexity.

      • From Table 1, which presents means, standard deviations and statistics, it is hard to tell if there is a range of symptoms or if there are some participants with 22q11.2DS who met diagnostic criteria for a the listed disorder while others who have no or sub-threshold symptoms. This is important and informs the statistical analysis. Given the broad range of psychiatric symptoms, I also wonder if a composite score of psychopathology may be more appropriate. What about other psychiatric symptoms such as depression?

      We have added a supplementary figure to figure 1 to provide individual participants scores on psychiatric measures and FSIQ to fully inform the reader about individual data.

      We have taken the approach of using symptom scores, rather than using binary cut offs for diagnosis, to maximise the use of our dataset, and given many/all psychiatric phenotypes exist on a spectrum, to reflect the difference between clinical and research diagnoses.

      Regarding depression, it has been previously demonstrated in 22q11.2DS that mood disorders are rare at young ages (Chawner et al 2019), therefore given the low frequency, we have not included depression in this dataset

      We have considered the utility of a composite psychopathology score; however, it is already established that 22q11.2DS can be associated with a broad range of psychiatric/behavioural difficulties; in this study we were primarily interested in exploring the links (if any) between specific groups of symptoms, and specific features of the sleep phenotype. Therefore, we feel a composite psychopathology score would not add to the overall clarity of the manuscript

      • The age range is very broad spanning 6 to 20 years. As there are marked changes in the sleep EEG with age, it is important to understand the influence of age. The small sample size precludes investigating age by group interactions meaningfully, but the presentation of the ages of 22q11.2DS and controls, rather than means, standard deviations and ranges, would be helpful for the reader to understand the sample.

      We have added scatter plots of EEG measures and age to each figure supplement to allow the reader to see changes with age

      Also, a figure showing individual data (e.g., spindle power) as a function of age and group would be informative. The authors should also discuss the possibility that the difference between the groups may vary as a function of age as has been shown for cortical grey matter volume (Bagaiutdinova et al., Molecular Psychiatry, 2021).

      We have provided plots of individual data with age for our main figures, in the figure supplements. We also note we have included age as a covariate in all main statistical models (methods, page 39). We thank the reviewer for the additional reference, this has been added to the discussion (page 29)

      • There is a large group difference with regards to full scale IQ. IQ is associated with sleep spindles (e.g., Gruber et al., Int J of Psychphsy, 2013; Geiger et al., SLEEP, 2011). For this reason, the authors should control for IQ in all analyses.

      We note that the relationship between spindle characteristics and IQ has been questioned (e.g. Reynolds et al 2018 performed a meta-analysis which suggests no correlation with FSIQ, which would suggest against the suggested approach). We also note that genotype effects on FSIQ were not mediated by spindle properties. Furthermore, the phenotype in 22q11.2DS is complex, while lower IQ is a well evidenced part, it is only one component. We are unclear if it would be justified to regress out only one component of a phenotype.

      • The authors find greater power in the delta and sigma bands in 22q11.2DS compared to their siblings. Looking at the Figure 2, it appears power is elevated across frequencies. If this were the case, this would likely change the interpretation of the findings, and suggest that the sleep EEG likely reflects changes in cortical thickness between controls and 22q11.2DS participants.

      We thank the review for this interesting comment. We have now altered the approach taken to our analysis of spectral data in order to probe overall differences in overall power, using the IRASA approach described by Hahn et al 2020. We present these results on page 13, and use measures derived from this analysis in the mediation and behavioural analyses, and discuss these findings in the discussion (page 29)

      • Along the same lines as the above comment, it would be interesting to examine REM sleep and test how specific to sleep spindles and slow waves these findings are.

      We have now added analysis of REM-derived spectral measures, which we believe complement our finding of altered proportions of REM sleep in 22q11.2DS compared to controls (page 13)

      Reviewer #3 (Public Review):

      In this study, Donnelly and colleagues quantified sleep oscillations and their coordination in in young people with 22q11.2 Deletion Syndrome and their siblings. They demonstrate that 22q11.2DS was associated with enhanced power the in slow wave and sleep spindle range, elevated slow-wave and spindle amplitudes and altered coupling between spindles and slow-waves. In addition, spindle and slow-wave amplitudes in 22q11.2DS correlated negatively with the outcomes of a memory test. Overall, the topic and the results of the present study are interesting and timely. The authors employed many thoughtful analyses, making sense out of complicated data. However, some features of the manuscript need further clarification.

      1.) Several interesting results of the manuscript are related to altered sleep spindle characteristics in 22q11.2DS (increased power, increased amplitudes and altered coupling with slow waves). On top of that the authors report, that the spindle frequency was correlated with age. I was wondering whether the authors might want to take these individual (age-related) differences into account in their analyses. The authors could detect the peak spindle frequency per participant and inform their spindle detection procedure accordingly. This procedure might lead to an even more clear cut picture concerning altered spindle activity in 22q11.2DS.

      We thank the review for this informative suggestion. We have now implemented this method, detecting spindles for each individual at a frequency defined through IRASA analysis of the EEG (results, page 13; methods, page 35), and then using the properties of spindles detected through this method in further analysis.

      We have included age as a covariate in all main models (methods, page 39), and present individual data scattered with age in our figure supplements.

      2.) The authors state in the methods section that EEG data was re-referenced to a common average during pre-processing. Did the authors take into account that this reference scheme will lead to a polarity inversion of the signal, potentially over parietal/occipital areas? This inversion will not affect spindle related analyses, but might misguide the detection of slow waves and hence confound related analyses and results.

      We have reviewed our data preprocessing pipeline, and updated it based on the latest methods suggested from the EEGlab authors (methods, page 33). As a supplementary analysis we applied a heuristic signal polarity measure described by the authors of the luna software package https://zzz.bwh.harvard.edu/luna/vignettes/nsrr-polarity/ and did not observe any inversion of polarity in our sample.

      In the included figure (below) we calculated the Hjorth measure of signal polarity as described in luna, at every electrode and plotted a topoplot of the measure. In the figure numbers > 0 represent signals with a positive polarity, values < 0 a negative polarity. As demonstrated in the figure, there were no electrodes with a positive polarity, although we note that the most peripheral electrodes had an approximately neutral polarity, whereas more central electrodes had a slight negative bias.

      We also note that we only detected negative half waves with our slow wave detection algorithm, using a threshold set for each channel based on its own characteristics, so would not necessarily expect alterations in slow waves detection. Further, other authors have suggested that average referencing does not impact SW detection (e.g. Wennberg 2010)

      3.) I have some issues understanding the reported slow wave - spindle coupling results. Figure 5A indicates that ~100 degrees correspond to the down-state of the slow wave. Figure 5E shows that spindles preferentially clustered at fronto-central electrodes between 0 and 90 degrees, hence they seem to peak towards the slow wave downstate. This finding is rather puzzling given the prototypical grouping of sleep spindles by slow wave up-states (Staresina, 2015; Helfrich, 2018; Hahn, 2020). Could it be that the majority of detected spindles represent slow spindles (9-12 Hz; Mölle, 2011)?

      We observed peaks of spindle activity in the range of 9 – 24 degrees (so on the descending slope from the positive peak of the slow wave), but an average spindle frequencies in the 12 – 13 Hz range. Given we allowed each individual to have an individual spindle detection frequency, as above, and did not observe bimodal distributions of power in the sigma frequency band (Figure 2 Supplement 1), we do not believe our spindles specifically represent slow spindles

      Slow spindles are known to peak rather at the up- to down-state transition (which would fit the reported results) and show a frontal distribution (which again would fit to the spindle amplitude topographies in Fig 3E). If that was the case, it would make sense to specifically look at fast spindles (12-16 Hz) as well, given their presumed role in memory consolidation (Klinzing, 2019).

      We agree with the reviewer’s assessment of the distribution of the putative spindles we have detected. However, as we and other authors (Hahn et al 2020) have noted, we did not observe discrete fast and slow spindle frequency peaks in our analysis of the PSD (as has been observed by other authors e.g. Cox et al 2017). For this reason, and to reduce the complexity of the manuscript, we believe the best approach with our dataset is to focus on spindles at large, rather than detecting spindles in arbitrary frequency bands.

      In addition, is it possible that the rather strong phase shift from fronto-central to occipital sites is driven by a polarity inversion due to using a common reference (see comment 2)?

      As noted above, we do not observe significant polarity inversion in our signals using the luna heuristic measure. We were not able to identify published literature to inform our investigation of this suggestion, but would be happy to consider any specific suggestions from the reviewer

      Apart from that I would suggest to statistically evaluate non-uniformity using e.g. the Rayleigh test (both within and across participants).

      We have added an analysis of non-uniformity to the results section (results, page 20).

      4.) Somewhat related to the point raised above. The authors state that in the methods that slow wave spindle events were defined as time-windows were the peaks of spindles overlapped with slow waves. How was the duration of slow waves defined in this scenario? If it was up- to up-state the authors might miss spindles which lock briefly after the post down-state upstate, thereby overrepresenting spindles that lock to early phases of slow waves. Why not just defining a clear slow wave related time-window, such as slow wave down-state {plus minus} 1.5 seconds?

      We have implemented this suggestion (methods, page 38)

      5.) The authors correlated the NREM sleep features with the outcomes of a post-sleep memory test (both encoding and an initial memory test took place pre-sleep). If the authors want to show a clear association between sleep-related oscillations and the behavioural expressions of memory consolidation, taking just the post sleep memory task is probably not the best choice. The post-sleep test will, as the pre-sleep test, in isolation rather reflect general memory related abilities. To uncover the distinct behavioural effects of consolidation the authors should assess the relative difference between the pre- and post-sleep memory performance and correlate this metric with their EEG outcomes.

      We have added evening-morning performance difference as a measure to the results (page 6); however as there was no difference between groups in overnight change in performance, we focus on morning performance in relating behaviour to EEG outcomes (explored in results, page 6)

    1. Author Response:

      Reviewer #1 (Public Review):

      Cell surface proteins are of vital interest in the functions and interactions of cells and their neighbors. In addition, cells manufacture and secrete small membrane vesicles that appear to represent a subset of the cell surface protein composition.

      Various techniques have been developed to allow the molecular definition of many cell surface proteins but most rely on the special chemistry of amino acid residues in exposed on the parts of membrane proteins exposed to the cell exterior.

      In this report Kirkemo et al. have devised a method that more comprehensively samples the cell surface protein composition by relying on the membrane insertion or protein glycan adhesion of an enzyme that attaches a biotin group to a nearest neighbor cellular protein. The result is a more complex set of proteins and distinctive differences between normal and a myc oncogene tumor cells and their secreted extracellular vesicle counterparts. These results may be applied to the identification of unique cell surface determinants in tumor cells that could be targets for immune or drug therapy. The results may be strengthened by a more though evaluation of the different EV membrane species represented in the broad collection of EVs used in this investigation.

      We thank the reviewer for recognizing the importance of the work outlined in the manuscript. We have addressed the necessary improvements in the essential revisions section above.

      Reviewer #2 (Public Review):

      This paper describes two methods for labeling cell-surface proteins. Both methods involve tethering an enzyme to the membrane surface to probe the proteins present on cells and exosomes. Two different enzyme constructs are used: a single strand lipidated DNA inserted into the membrane that enables binding of an enzyme conjugated to a complementary DNA strand (DNA-APEX2) or a glycan-targeting binding group conjugated to horseradish peroxidase (WGA-HRP). Both tethered enzymes label proteins on the cell surface using a biotin substrate via a radical mechanism. The method provides significantly enhanced labeling efficiency and is much faster than traditional chemical labeling methods and methods that employ soluble enzymes. The authors comprehensively analyze the labeled proteins using mass spectrometry and find multiple proteins that were previously undetectable with chemical methods and soluble enzymes. Furthermore, they compare the labeling of both cells and the exosomes that are formed from the cells and characterize both up- and down-regulated proteins related to cancer development that may provide a mechanistic underpinning.

      Overall, the method is novel and should enable the discovery of many low-abundance cell-surface proteins through more efficient labeling. The DNA-APEX2 method will only be accessible to more sophisticated laboratories that can carry out the protocols but the WGA-HRP method employs a readily available commercial product and give equivalent, perhaps even better, results. In addition, the method cannot discriminate between proteins that are genuinely expressed on the cell from those that are non-specifically bound to the cell surface.

      The authors describe the approach and identify two unique proteins on the surface of prostate cell lines.

      Strengths:

      Good introduction with appropriate citations of relevant literature Much higher labeling efficiency and faster than chemical methods and soluble enzyme methods. Ability to detect low-abundance proteins, not accessible from previous labeling methods.

      Weaknesses: The DNA-APEX2 method requires specialized reagents and protocols that are much more challenging for a typical laboratory to carry out than conventional chemical labeling methods.

      The claims and findings are sound. The finding of novel proteins and the quantitative measurement of protein up- and down-regulation are important. The concern about non-specifically bound proteins could be addressed by looking at whether the detected proteins have a transmembrane region that would enable them to localize in the cell membrane.

      We thank the reviewer for recognizing the strengths and importance of this work. We also thank the reviewer for mentioning the issue of non-specifically bound proteins. As addressed above in the essential revisions sections, we believe that any low affinity, non-specific binding proteins are likely removed in the multiple wash/centrifugation steps on cells or the multiple centrifugation steps and sucrose gradient purification on EVs. Given the likelihood for removal of non-specific binders, we believe that the secreted proteins identified are likely high affinity interactions and their differential expression on either cells or EVs play an important part in the downstream biology of both sample types. However, the previous data presentation did not clarify which proteins pertained to the transmembrane plasma membrane proteome versus secreted protein forms. For further clarity in the data presentation (Figure 3D, 4D, 5D), we have bolded proteins that are also found in the SURFY database that only includes surface annotated proteins with a predicted transmembrane domain (Bausch-Fluck et al., The in silico human surfaceome. PNAS. 2018). We have also italicized proteins that are annotated to be secreted from the cell to the extracellular space (Uniprot classification). We have updated the text and caption as shown below:

      New Figure 3:

      Figure 3. WGA-HRP identifies a number of enriched markers on Myc-driven prostate cancer cells. (A) Overall scheme for biotin labeling, and label-free quantification (LFQ) by LC-MS/MS for RWPE-1 Control and Myc over-expression cells. (B) Microscopy image depicting morphological differences between RWPE-1 Control and RWPE-1 Myc cells after 3 days in culture. (C) Volcano plot depicting the LFQ comparison of RWPE-1 Control and Myc labeled cells. Red labels indicate upregulation in the RWPE-1 Control cells over Myc cells and green labels indicate upregulation in the RWPE-1 Myc cells over Control cells. All colored proteins are 2-fold enriched in either dataset between four replicates (two technical, two biological, p<0.05). (D) Heatmap of the 15 most upregulated transmembrane (bold) or secreted (italics) proteins in RWPE-1 Control and Myc cells. Scale indicates intensity, defined as (LFQ Area - Mean LFQ Area)/standard deviation. Extracellular proteins with annotated transmembrane domains are bolded and annotated secreted proteins are italicized. (E) Table indicating fold-change of most differentially regulated proteins by LC-MS/MS for RWPE-1 Control and Myc cells. (F) Upregulated proteins in RWPE-1 Myc cells (Myc, ANPEP, Vimentin, and FN1) are confirmed by western blot. (G) Upregulated surface proteins in RWPE-1 Myc cells (Vimentin, ANPEP, FN1) are detected by immunofluorescence microscopy. The downregulated protein HLA-B by Myc over-expression was also detected by immunofluorescence microscopy. All western blot images and microscopy images are representative of two biological replicates. Mass spectrometry data is based on two biological and two technical replicates (N = 4).

      New Figure 4:

      Figure 4. WGA-HRP identifies a number of enriched markers on Myc-driven prostate cancer EVs. (A) Workflow for small EV isolation from cultured cells. (B) Labeled proteins indicating canonical exosome markers (ExoCarta Top 100 List) detected after performing label-free quantification (LFQ) from whole EV lysate. The proteins are graphed from least abundant to most abundant. (C) Workflow of exosome labeling and preparation for mass spectrometry. (D) Heatmap of the 15 most upregulated proteins in RWPE-1 Control or Myc EVs. Scale indicates intensity, defined as (LFQ Area - Mean LFQ Area)/SD. Extracellular proteins with annotated transmembrane domains are bolded and annotated secreted proteins are italicized. (E) Table indicating fold-change of most differentially regulated proteins by LC-MS/MS for RWPE-1 Control and Myc cells. (F) Upregulated proteins in RWPE-1 Myc EVs (ANPEP and FN1) are confirmed by western blot. Mass spectrometry data is based on two biological and two technical replicates (N = 4). Due to limited sample yield, one replicate was performed for the EV western blot.

      New Figure 5:

      Figure 5. WGA-HRP identifies a number of EV-specific markers that are present regardless of oncogene status. (A) Matrix depicting samples analyzed during LFQ comparison--Control and Myc cells, as well as Control and Myc EVs. (B) Principle component analysis (PCA) of all four groups analyzed by LFQ. Component 1 (50.4%) and component 2 (15.8%) are depicted. (C) Functional annotation clustering was performed using DAVID Bioinformatics Resource 6.8 to classify the major constituents of component 1 in PCA analysis. (D) Heatmap of the 25 most upregulated proteins in RWPE-1 cells or EVs. Proteins are listed in decreasing order of expression with the most highly expressed proteins in EVs on the far left and the most highly expressed proteins in cells on the far right. Scale indicates intensity, defined as (LFQ Area - Mean LFQ Area)/SD. Extracellular proteins with annotated transmembrane domains are bolded and annotated secreted proteins are italicized. (E) Table indicating fold-change of most differentially regulated proteins by LC-MS/MS for RWPE-1 EVs compared to parent cells. (F) Western blot showing the EV specific marker ITIH4, IGSF8, and MFGE8.Mass spectrometry data is based on two biological and two technical replicates (N = 4). Due to limited sample yield, one replicate was performed for the EV western blot.

      Authors mention time-sensitive changes but it is unclear how this method would enable one to obtain this kind of data. How would this be accomplished? The statement "Due to the rapid nature of peroxidase enzymes (1-2 min), our approaches enable kinetic experiments to capture rapid changes, such as binding, internalization, and shuttling events." Yes, it is faster, but not sure I can think of an experiment that would enable one to capture such events.

      We thank the reviewer for this comment and giving us an opportunity to elaborate on the types of experiments enabled by this new method. A previous study (Y, Li et al. Rapid Enzyme-Mediated Biotinylation for Cell Surface Proteome Profiling. Anal. Chem. 2021) showed that labeling the cell surface with soluble HRP allowed the researchers to detect immediate surface protein changes in response to insulin treatment. They demonstrated differential surfaceome profiling changes at 5 minutes vs 2 hours following treatment with insulin. Only methods utilizing these rapid labeling enzymes could allow for this type of resolution. A few other biological settings that experience rapid cell surface changes are: response to drug treatment, T-cell activation and synapse formation (S, Valitutti, et al. The space and time frames of T cell activation at the immunological synapse. FEBS Letters. 2010) and GPCR activation (T, Gupte et al. Minute-scale persistence of a GPCR conformation state triggered by non-cognate G protein interactions primes signaling. Nat. Commun. 2019). We also believe the method would be useful for post-translational processes where proteins are rapidly shuttling to the cell surface. We have updated the discussion to elaborate on these types of experiments.

      "Due to the fast kinetics of peroxidase enzymes (1-2 min), our approaches could enable kinetic experiments to capture rapid post-translational trafficking of surfaces proteins, such as response to insulin, certain drug treatments, T-cell activation and synapse formation, and GPCR activation."

      The authors do not have any way to differentiate between proteins expressed by cells and presented on their membranes from proteins that non-specifically bind to the membrane surface. Non-specific binding (NSB) is not addressed. Proteins can non-specifically bind to the cell or EV surface. The results are obtained by comparisons (cells vs exosomes, controls vs cancer cells), which is fine because it means that what is being measured is differentially expressed, so even NSB proteins may be up- and down-regulated. But the proteins identified need to be confirmed. For example, are all the proteins being detected transmembrane proteins that are known to be associated with the membrane?

      As mentioned above, we utilized the most rigorous informatics analysis available (Uniprot and SURFY) to annotate the proteins we find as having a signal sequence and/or TM domain. Data shown in heatmaps are based off of significance (p < 0.05) across all four replicates, which supports that any secreted proteins present are likely due to actual biological differences between oncogenic status and/or sample origin (i.e. EV vs cell). We have addressed this point in a previous comment above.

      The term "extracellular vesicles" (EVs) might be more appropriate than "exosomes" to describe the studied preparation.

      As we describe above in response to earlier comments, we have systematically changed from using exosomes to small extracellular vesicles and better defined the isolation procedure that we used in the methods section.

      Reviewer #3 (Public Review):

      The article by Kirkemo et al explores approaches to analyse the surface proteome of cells or cell-derived extracellular vesicles (EVs, called here exosomes, but the more generic term "extracellular vesicles" would be more appropriate because the used procedure leads to co-isolation of vesicles of different origin), using tools to tether proximity-biotinylation enzymes to membranes. The authors determine the best conditions for surface labeling of cells, and demonstrate that tethering the enzymes (APEX or HRP) increases the number of proteins detected by mass-spectrometry. They further use one of the two approaches (where HRP binds to glycans), to analyse the biotinylated proteome of two variants of a prostate cancer cell line, and the corresponding EVs. The approaches are interesting, but their benefit for analysis of cells or EVs is not very strongly supported by the data.

      First, the authors honestly show (fig2-suppl figures) that only 35% of the proteins identified after biotinylation with their preferred tool actually correspond to annotated surface proteins. This is only slightly better than results obtained with a non-tethered sulfo-NHS-approach (30%).

      We thank the reviewer for this comment. The reason we utilize membrane protein enrichment methods is that membrane protein abundance is low compared to cytosolic proteins and their identification can be overwhelmed by cytosolic contaminants. Nonetheless, despite our best efforts to limit labeling to the membrane proteins, cytosolic proteins can carry over. Thus, we utilize informatics methods to identify the proteins that are annotated to be membrane associated. The Uniprot GOCC (Gene Ontology Cellular Component) Plasma Membrane database is the most inclusive of membrane proteins only requiring they contain either a signal sequence, transmembrane domain, GPI anchor or other membrane associated motifs yielding a total of 5,746 proteins. This will include organelle membrane proteins. It is known that proteins can traffic from the internal organelles to the cell surface so these can be bonified cell surface proteins too. To increase the informatics stringency for membrane proteins we have now applied a new database aggregated from work by the Wollscheid lab, called SURFY (Bausch-Fluck et al., The in silico human surfaceome. PNAS. 2018). This is a machine learning method trained on 735 high confidence membrane proteins from the Cell Surface Protein Atlas (CSPA). SURFY predicts a total of 2,886 cell surface proteins. When we filter our data using SURFY for proteins, peptides and label free quantitation (LFQ) area for three methods, we find that the difference between NHS-Biotin and WGA-HRP expands considerably (see new Figure 3-Supplemental Figure 1 below). We observe these differences when the datasets are searched with either the GOCC Plasma Membrane database or the entire human Uniprot database. The difference is especially large for LFQ analysis, which quantitatively scores peptide intensity as opposed to simply count the number hits as for protein and peptide analysis. Cytosolic carry over is the major disadvantage of NHS-Biotin, which suppresses signal strength and is reflected in the lower LFQ values (24% for NHS-biotin compared to 40% for WGA-HRP). We have updated the main text and supplemental figure below:

      "Both WGA-HRP and biocytin hydrazide had similar levels of cell surface enrichment on the peptide and protein level when cross-referenced with the SURFY curated database for extracellular surface proteins with a predicted transmembrane domain (Figure 3 - Figure supplement 1A). Sulfo-NHS-LC-LC-biotin and whole cell lysis returned the lowest percentage of cell surface enrichment, suggesting a larger portion of the total sulfo-NHS-LC-LC-biotin protein identifications were of intracellular origin, despite the use of the cell-impermeable format. These same enrichment levels were seen when the datasets were searched with the curated GOCC-PM database, as well as the Uniprot entire human proteome database (Figure 3 - Figure supplement 1B). Importantly, of the proteins quantified across all four conditions, biocytin hydrazide and WGA-HRP returned higher overall intensity values for SURFY-specified proteins than either sulfo-NHS-LC-LC-biotin or whole cell lysis. Importantly, although biocytin hydrazide shows slightly higher cell surface enrichment compared to WGA-HRP, we were unable to perform the comparative analysis at 500,000 cells--instead requiring 1.5 million--as the protocol yielded too few cells for analysis."

      Figure 3-Figure Supplement 1. Comparison of surface enrichment between replicates for different mass spectrometry methods. (A) The top three methods (NHS-Biotin, Biocytin Hydrazide, and WGA-HRP) were compared for their ability to enrich cell surface proteins on 1.5 M RWPE-1 Control cells by LC-MS/MS after being searched with the Uniprot GOCC Plasma Membrane database. Shown are enrichment levels on the protein, peptide, and average MS1 intensity of top three peptides (LFQ area) levels. (B) The top three methods (NHS-Biotin, Biocytin Hydrazide, and WGA-HRP) were compared for their ability to enrich cell surface proteins on 1.5 M RWPE-1 Control cells by LC-MS/MS after being searched with the entire human Uniprot database. Shown are enrichment levels on the protein, peptide, and average MS1 intensity of top three peptides (LFQ area) levels. Proteins or peptides detected from cell surface annotated proteins (determined by the SURFY database) were divided by the total number of proteins or peptides detected. LFQ areas corresponding to cell surface annotated proteins (SURFY) were divided by the total area sum intensity for each sample. The corresponding percentages for two biological replicates were plotted.

      There are additional advantages to WGA-HRP over NHS-biotin. These include: (i) labeling time is 2 min versus 30 min, which would afford higher kinetic resolution as needed, and (ii) the NHS-biotin labels lysines, which hinders tryptic cleavage and downstream peptide analysis, whereas the WGA-HRP labels tyrosines, eliminating impacts on tryptic patterns. WGA-HRP is slightly below biocytin hydrazide in peptide and protein ID and somewhat more by LFQ. However, there are significant advantages over biocytin hydrazide: (i) sample size for WGA-HRP can be reduced a factor of 3-5 because of cell loss during the multiple washing steps after periodate oxidation and hydrazide labeling, (ii) the time of labeling is dramatically reduced from 3 hr for hydrazide to 2 min for WGA-HRP, and (iii) the HRP enzyme has a large labeling diameter (20-40 nm, but also reported up to 200 nm) and can label non-glycosylated membrane proteins as opposed to biocytin hydrazide that only labels glycosylated proteins. The hydrazide method is the current standard for membrane protein enrichment, and we feel that the WGA-HRP will compete especially when cell sample size is limited or requires special handling. In the case of EVs, we were not able to perform hydrazide labeling due to the two-step process and small sample size.

      Indeed the list of identified proteins in figures 4 and 5 include several proteins whose expected subcellular location is internal, not surface exposed, and whose location in EVs should also be inside (non-exhaustively: SDCBP = syntenin, PDCD6IP = Alix, ARRDC1, VPS37B, NUP35 = nucleopore protein)…

      We thank the reviewer for this comment. We have elaborated on this point in a number of response paragraphs above. The proteins that the reviewer points out are annotated as “plasma membrane” in the very inclusive GOCC plasma membrane database. However, this means that they may also spend time in other locations in the cell or reside on organelle membranes. We have done further analysis to remove any intracellular membrane residing proteins that are included in the GOCC plasma membrane database, including the five proteins mentioned above. We also have further highlighted proteins that appear in the SURFY database, as discussed above and in our response to Reviewer 2’s comment. To increase stringency, we have bolded proteins that are found in the more selective SURFY database and italicized secreted proteins. Due to our new analysis and data presentation, it is more clear which markers are bona fide extracellular resident membrane proteins. We have updated the Figures and Figure legends as mentioned above, as well as added this statement in the Data Processing and Analysis methods:

      "Additionally, to not miss any key surface markers such as secreted proteins or anchored proteins without a transmembrane domain, we chose to initially avoid searching with a more stringent protein list, such as the curated SURFY database. However, following the analysis, we bolded proteins found in the SURFY database and italicized proteins known to be secreted (Uniprot)."

      The membrane proteins identified as different between the control and Myc-overexpressing cells or their EVs, would have been identified as well by a regular proteomic analysis.

      To directly compare surfaceomes of EVs to cells, we are compelled to use the same proteomic method. For parental cell surfaceomic analysis, a membrane enrichment method is required due to the high levels of cytosolic proteins that swamp out signal from membrane proteins. Although EVs have a higher proportion of membrane to cytosol, whole EV proteomics would still have significant cytosolic contamination.

      Second, the title highlights the benefit of the technique for small-scale samples: this is demonstrated for cells (figures 1-2), but not for EVs: no clear quantitative indication of amount of material used is provided for EV samples. Furthermore, no comparison with other biotinylation technics such as sulfo-NHS is provided for EVs/exosomes. Therefore, it is difficult to infer the benefit of this technic applied to the analysis of EVs/exosomes.

      We appreciate the reviewer for this comment. We have updated the methods as mentioned above in our response to the Essential Revisions. In brief, the yield of EVs post-sucrose gradient isolation was 3-5 µg of protein from 16x15 cm2 plates of cells, totaling 240 mL of media. Since we had previously demonstrated that our method was superior to sulfo-NHS for enriching surface proteins on cells, we proceeded to use the WGA-HRP for the EV labeling experiments.

      In addition, the WGA-based tethering approach, which is the only one used for the comparative analysis of figures 4 and 5, possibly induces a bias towards identification of proteins with a particular glycan signature: a novelty would possibly have come from a comparison of this approach with the other initially evaluated, the DNA-APEX one, where tethering is induced by lipid moieties, thus should not depend on glycans. The authors may have then identified by LC-MS/MS specific glycan-associated versus non-glycan-associated proteins in the cells or EVs membranes. Also ideally, the authors should have compared the 4 combinations of the 2 enzymes (APEX and HRP) and 2 tethers (lipid-bound DNA and WGA) to identify the bias introduced by each one.

      We thank the reviewer for this comment. We performed analysis to determine whether there was a bias towards Uniprot annotated “Glyco” vs “Non-Glyco” surface proteins within the SURFY database identified across the WGA-HRP, APEX2-DNA, APEX2, and HRP labeling methods. We performed this analysis by measuring the total LFQ area detected for each category (glycoprotein vs non-glycoprotein) and dividing that by the total LFQ area found across all proteins detected in the sample. We found similar normalized areas of non-glyco surface proteins between WGA-HRP and APEX2-DNA suggesting there is not a bias against non-glycosylated proteins in the WGA-HRP sample. There were slightly elevated levels of Glycoproteins in the WGA-HRP sample over APEX2-DNA. It is not surprising to us that there is little bias because the free-radicals generated by biotin-tyramide can label over tens of nanometers and thus can label not just the protein they are attached to, but neighbors also, regardless of glycosylation status. We have added this as Figure 2-Supplement 3, and amended the text in the manuscript below in purple.

      Figure 2 – Figure Supplement 3: Comparison of enrichment of Glyco- vs Non-Glyco-proteins. (A) TIC area of Uniprot annotated Glycoproteins compared to Non-Glycoproteins in the SURFY database for each labeling method compared to total TIC area. There was not a significant difference in detection of Non-Glycoproteins detected between WGA-HRP and APEX2-DNA and only a slightly higher detection of Glycoproteins in the WGA-HRP sample over APEX2-DNA.

      "As the mode of tethering WGA-HRP involves GlcNAc and sialic acid glycans, we wanted to determine whether there was a bias towards Uniprot annotated 'Glycoprotein' vs 'Non-Glycoprotein' surface proteins identified across the WGA-HRP, APEX2-DNA, APEX2, and HRP labeling methods. We looked specifically looked at surface proteins founds in the SURFY database, which is the most restrictive surface database and requires that proteins have a predicted transmembrane domain (Bausch-Fluck et al., The in silico human surfaceome. PNAS. 2018). We performed this analysis by measuring the average MS1 intensity across the top three peptides (area) for SURFY glycoproteins and non-glycoproteins for each sample and dividing that by the total LFQ area found across all GOCC annotated membrane proteins detected in each sample. We found similar normalized areas of non-glyco surface proteins across all samples (Figure 2 - Figure supplement 4). If a bias existed towards glycosylated proteins in WGA-HRP compared to the glycan agnostic APEX2-DNA sample, then we would have seen a larger percentage of non-glycosylated surface proteins identified in APEX2-DNA over WGA-HRP. Due to the large labeling radius of the HRP enzyme, we find it unsurprising that the WGA-HRP method is able to capture non-glycosylated proteins on the surface to the same degree (Rees et al. Selective Proteomic Proximity Labeling Assay SPPLAT. Current Protocols in Protein Science. 2015). There is a slight increase in the area percentage of glycoproteins detected in the WGA-HRP compared to the APEX2-DNA sample but this is likely due to the fact that a greater number of surface proteins in general are detected with WGA-HRP."

      As presented the article is thus an interesting technical description, which does not convince the reader of its benefit to use for further proteomic analyses of EVs or cells. Such info is of course interesting to share with other scientists as a sort of "negative" or "neutral" result. Maybe a novelty of the presented work is the differential proteome analysis of surface enriched EV/cell proteins in control versus myc-expressing cells. Such analyses of EVs from different derivatives of a tumor cell line have been performed before, for instance comparing cells with different K-Ras mutations (Demory-Beckler, Mol Cell proteomics 2013 # 23161513). However, here the authors compare also cells and EVs, and find possibly interesting discrepancies in the upregulated proteins. These results could probably be exploited more extensively. For instance, authors could give clearer info (lists) on the proteins differentially regulated in the different comparisons: in EVs from both cells, in EVs vs cells, in both cells.

      We appreciate the reviewer for this critique and have updated the manuscript accordingly. We have changed the title to “Cell surface tethered promiscuous biotinylators enable small-scale comparative surface proteomic analysis of human extracellular vesicles and cells” to more accurately depict the focus of our manuscript which, as the reviewer highlighted, is that this technology allows for comparative analysis between the surfaceomes of cells vs EVs. We appreciate the fine work from the Coffey lab on whole EV analysis of KRAS transformed cells. They identified a mix of surface and cytosolic proteins that change in EVs from the transformed cells, whereas our data focuses specifically on the surfaceome differences in Myc transformed and non-transformed cells and corresponding small EVs. We believe this makes important contributions to the field as well.

      To further address the reviewer’s suggestions, we additionally have significantly reorganized the figures to better display the differentially regulated proteins. We have removed the volcano plots and instead included heatmaps with the top 30 (Figure 3 and Figure 4) and top 50 (Figure 5) differentially regulated proteins across cells and EVs. We have also updated the lists of proteins in the supplemental source tables section. See responses to Reviewer 2 above for the updates to Figures 3-5. We have additionally included supplemental figures with lists of differentially upregulated proteins in the EV and Cell samples, which are shown below:

      Figure 3 – Supplement 3: List of proteins comparing enriched targets (>2-fold) in Myc cells versus Control cells. Targets that were found enriched (Myc/Control) in the Control cells (left) and Myc cells (right). The fold-change between Myc cells and Control cells is listed in the column to the right of the gene name.

      Figure 4 – Supplement 1: List of proteins comparing enriched targets (>1.5-fold) in Myc EVs versus Control EVs. Targets that were found enriched (Myc/Control) in the Control EVs (left) and Myc EVs (right). The fold-change between Myc EVs and Control EVs is listed in the column to the right of the gene name.

      Figure 4 – Figure Supplement 2: Venn diagram comparing enriched targets (>2-fold) in Cells and EVs. (A) Targets that were found enriched in the Control EVs (purple) and Control cells (blue) when each is separately compared to Myc EVs and Myc cells, respectively. The 5 overlapping enriched targets in common between Control cells and Control EVs are listed in the center. (B) Targets that were found enriched in the Myc EVs (purple) and Myc cells (blue) when each is separately compared to Control EVs and Control cells, respectively. The 12 overlapping enriched targets in common between Myc cells and Myc EVs are listed in the center.

      Figure 5 - Supplement 1: List of proteins comparing enriched targets (>2-fold) in Control EVs versus Control cells and Myc EVs versus Myc cells. (A)Targets that were found enriched (EV/cell) in the Control samples are listed. The fold-change values between Control EVs and Control cells are listed in the column to the right of the gene name. (B)Targets that were found enriched (EV/cell) in the Myc samples are listed. The fold-change values between Myc EVs and Myc cells are listed in the column to the right of the gene name.

    1. Author Response

      Reviewer #2 (Public Review):

      The work proposes a new computational rule for classifying synaptic plasticity outcome based on the geometry of synaptic enzyme dynamics. Specifically, the authors implement a multi-timescale model of hippocampal synaptic plasticity induction that takes into account the dynamics of the membrane potential, calcium concentration as well as CaMKII and calcineurin signalling pathways. They show that the proposed rule could be applied to reproduce the outcomes from nine published experimental studies involving different spike-timing and frequency-dependent plasticity induction protocols, animal ages, and experimental conditions. The model has been also used to generate predictions regarding the effect of spike-timing irregularity on plasticity outcomes. The proposed approach constitutes an interesting and original idea that contributes to the ongoing effort in discovering the rules of synaptic plasticity.

      The conclusions of this paper are mostly well supported by data, but some model assumptions and interpretation of modelling results need to be clarified and extended.

      1) The proposed model captures well the stochastic nature of the dendritic spine ion channels and receptors except for the calcium-sensitive potassium (SK) channel that has been modelled deterministically. Given that the same justification in terms of small number of channels present in the small dendritic spine compartment applies to the SK channels as well as to the voltage gated calcium channels and the AMPA and NMDA receptors, it is not clear why the authors have chosen a deterministic representation in the case of SK. The implications of this assumption needs to be investigated and discussed.

      There are several stochastic models of AMPA and NMDA receptors based on single-channel recordings. Additionally, we had enough experimental data on single channel recordings to build a custom Markov chain model of VGCCs. For the SK channel, we could not find enough experimental data (age-dependence activity, temperature sensitivity, etc.) to custom-build a stochastic model. We thus decided to implement a deterministic model. Yet, we understand the reviewers’ comment that in theory, a stochastic model of SK channels could impact our results. We thus now provide a simulation with a stochastic model of SK, comparing it to the deterministic model implemented in the study.

      We describe a minimal version of a stochastic model of SK compatible with the deterministic version. The deterministic model of SK channel fit at ~35C is described in the methods section.

      Because of the factor ρ 𝑓𝑆𝐾 in the equation, which multiplies r(Ca) by ~2, the equation cannot be related to a 2-state Markov chain (MC). This could probably be possible with a 3-state MC but we used a different strategy. Noting that ρ 𝑆𝐾 ∼ 2 , we introduce a new equation

      As 0 < r(Ca) < 1, it is straightforward to introduce a 2-state MC for which the above equation describes the probability of the open state. We then simulate two such independent (for a given Ca concentration) channels and approximate 𝑚 𝑆𝐾 as the sum (which belongs to [0,2Nsk]) of the open states for the 2 channels.

      As the reviewer can see in the figure below, we do not find a major difference in the simulations of 3 protocols. Thus, we argue that adding a stochastic version of the SK channels in our current study would not fundamentally alter our main conclusions.

      Figure Legend: a comparison using Tigaret et al. 2016 1Pre2Post10 and 1Pre2Post50 protocols, and 900 at 50 Hz protocol from Dudek and Bear 1992 (100 repetitions) between the model with the deterministic SK channel (original model - blue), and the modified model including the stochastic SK channel (stochastic SK - red). Deterministic vs stochastic SK channel does not significantly modify the model’s behaviour.

      To explain our rationale of using a deterministic version of SK channel, we provide this sentence in the Methods when describing SK channel model: “"Due to a lack of single-channel recordings of SK channels, and a lack of published stochastic models of SK channels, we modelled SK channels deterministically. In tests we found that this assumption had only a negligible impact on the outcomes of plasticity protocols (data not shown)" (page 40).

      2) Many of the model parameters have been set to values previously estimated from synaptic physiology and biochemistry experiments, However, a significant number of important parameter values have been tuned to reproduce the plasticity experiments targeted in this study. As such, it needs to be explained which of the plasticity outcomes have been reproduced because the parameters are chosen to do so. A clarification would have helped to substantiate the authors' conclusions.

      Most parameters were set with values previously defined by experimental work. We referred to these publications where necessary throughout the Methods and Tables in our original manuscript. For the few free parameters that were adjusted, we now provide additional information wherever necessary for the Tables concerned.

      ● In the legend of Table 4 (neuron electrical properties), we explain which parameters are different from values obtained from the literature to fit experimental data (Golding et al. 2001; Buchanan et al. 2007).

      ● Parameters for the sodium and potassium conductance (Table 5) are labelled as generic since they are intentionally set to produce the BaP dynamics we have shown in the paper.

      ● Table 6 has no free parameters.

      ● Table 7 caption now includes a description saying ’Note that the buffer concentration, calcium diffusion coefficient, calcium diffusion time constant and calcium permeability were considered free parameters to adjust the calcium dynamics’.

      ● In Table 8 we had originally pointed out how we adapted the GluN2B rates from a published GluN2A model (Popescu et al. 2004; and Iacobucci and Popesco 2018). We now describe this adaptation in the Table 8 legend. In this Table, we now also better explain how we adjusted the NMDAr model to reflect the ratio between GluN2B and GluN2A, fitted from Sinclair et al. 2016; and the NMDAr conductance depending on calcium fitted from Maki and Popescu 2014.

      ● In Table 9 caption we now explain how the GABAr number and conductance were modified to fit GABAr currents as in Figures 15 b and e. The relevant parameters are indicated in the table.

      ● In Table 10 caption we now state the number of VGCCs per subtype that we used as a free parameter to reproduce the calcium dynamics (Figure 12).

      3) Adding experimental testing of model predictions, for example, that firing variability can alter the rules of plasticity, in the sense that it is possible to add noise to cause LTP for protocols that did not otherwise induce plasticity would be needed to increase confidence in the presented modelling results.

      We agree that it would be interesting in the future to test the many model predictions suggested in this work with biological experiments. This would however require a lot of work and will be the subject of further studies.

      Reviewer #3 (Public Review):

      This manuscript presents and analyzes a novel calcium-dependent model of synaptic plasticity combining both presynaptic and postsynaptic mechanisms, with the goal of reproducing a very broad set of available experimental studies of the induction of long-term potentiation (LTP) vs. long-term depression (LTD) in a single excitatory mammalian synapse in the hippocampus. The stated objective is to develop a model that is more comprehensive than the often-used simplified phenomenological models, but at the same time to avoid biochemical modeling of the complex molecular pathways involved in LTP and LTD, retaining only its most critical elements. The key part of this approach is the proposed "geometric readout" principle, which allows to predict the induction of LTP vs. LTD by examining the concentration time course of the two enzymes known to be critical for this process, namely (1) the Ca2+/calmodulin-bound calcineurin phosphatase (CaN), and (2) the Ca2+/calmodulin-bound protein kinase (CaMKII). This "geometric readout" approach bypasses the modeling of downstream pathways, implicitly assuming that no further biochemical information is required to determine whether LTP or LTD (or no synaptic change) will arise from a given stimulation protocol. Therefore, it is assumed that the modeling of downstream biochemical targets of CaN and CaMKII can be avoided without sacrificing the predictive power of the model. Finally, the authors propose a simplified phenomenological Markov chain model to show that such "geometric readout" can be implemented mechanistically and dynamically, at least in principle.

      Importantly, the presented model has fully stochastic elements, including stochastic gating of all channels, stochastic neurotransmitter release and stochastic implementation of all biochemical reactions, which allows to address the important question of the effect of intrinsic and external noise on the induction of LTP and LTD, which is studied in detail in this manuscript.

      Mathematically, this modeling approach resembles a continuous stochastic version of the "liquid computing" / "reservoir computing" approach: in this case the "hidden layer", or the reservoir, consists of the CaMKII and CaM concentration variables. In this approach, the parameters determining the dynamics of these intermediate ("hidden") variables are kept fixed (here, they are constrained by known biophysical studies), while the "readout" parameters are being trained to predict a target set of experimental observations.

      Strengths:

      1) This modeling effort is very ambitious in trying to match an extremely broad array of experimental studies of LTP/LTD induction, including the effect of several different pre- and post-synaptic spike sequence protocols, the effect of stimulation frequency, the sensitivity to extracellular Ca2+ and Mg2+ concentrations and temperature, the dependence of LTP/LTD induction on developmental state and age, and its noise dependence. The model is shown to match this large set of data quite well, in most cases.

      2) The choice for stochastic implementation of all parts of the model allows to fully explore the effects of intrinsic and extrinsic noise on the induction of LTP/LTD. This is very important and commendable, since regular noise-less spike firing induction protocols are not very realistic, and not every relevant physiologically.

      3) The modeling of the main players in the biochemical pathways involved in LTP/LTD, namely CaMKII and CaN, aims at sufficient biological realism, and as noted above, is fully stochastic, while other elements in the process are modeled phenomenologically to simplify the model and reveal more clearly the main mechanism underlying the LTP/LTD decision switch.

      4) There are several experimentally verifiable predictions that are proposed based on an in-depth analysis of the model behavior.

      We thank the reviewer for pointing out these strengths.

      Weaknesses:

      1) The stated explicit goal of this work is the construction of a model with an intermediate level of detail, as compared to simplified "one-dimensional" calcium-based phenomenological models on the one hand, and comprehensive biochemical pathway models on the other hand. However, the presented model comes across as extremely detailed nonetheless. Moreover, some of these details appear to be avoidable and not critical to this work. For instance, the treatment of presynaptic neurotransmitter release is both overly detailed and not sufficiently realistic: namely, the extracellular Ca2+ concentration directly affects vesicle release probability but has no effect on the presynaptic calcium concentration. I believe that the number of parameters and the complexity in the presynaptic model could be reduced without affecting the key features and findings of this work.

      This point is largely answered in Essential Revisions point 4 where we argue the choices we made for the presynaptic model. We acknowledge, however, that in this current version, we did not incorporate all biophysical components, such as the modulation of presynaptic calcium concentration with external calcium variations and multivesicular release. The calcium-dependence of presynaptic release, as modeled currently, is however fitted in Figure 8e against data from Hardingham et al. 2006 and Tigaret et al. 2016. These current limitations could be addressed in a next version of our presynaptic model where we also plan to incorporate age and temperature influence.

      2) The main hypotheses and assumptions underlying this work need to be stated more explicitly, to clarify the main conclusions and goals of this modeling work. For instance, following much prior work, the presented model assumes that a compartment-based (not spatially-resolved) model of calcium-triggered processes is sufficient to reproduce all known properties of LTP and LTD induction and that neither spatially-resolved elements nor calcium-independent processes are required to predict the observed synaptic change. This could be stated more explicitly. It could also be clarified that the principal assumption underlying the proposed "geometric readout" mechanisms is that all information determining the induction of LTP vs. LTP is contained in the time-dependent spine-averaged Ca2+/calmodulin-bound CaN and CaMKII concentrations, and that no extra elements are required. Further, since both CaN and CaMKII concentrations are uniquely determined by the time course of postsynaptic Ca2+ concentration, the model implicitly assumes that the LTP/LTD induction depends solely on spine-averaged Ca2+ concentration time course, as in many prior simplified models. This should be stated explicitly to clarify the nature of the presented model.

      We thank the reviewer for the suggestions on how to clarify the main hypotheses and assumptions of our work. We slightly modified the sentences provided by the reviewer and added them in the main text (page 2, lines 82 and page 19, lines 593).

      3) In the Discussion, the authors appear to be very careful in framing their work as a conceptual new approach in modeling STD/STP, rather than a final definitive model: for instance, they explicitly discuss the possibility of extending the "geometric readout" approach to more than two time-dependent variables, and comment on the potential non-uniqueness of key model parameters. However, this makes it hard to judge whether the presented concrete predictions on LTP/LTD induction are simply intended as illustrations of the presented approach, or whether the authors strongly expect these predictions to hold. The level of confidence in the concrete model predictions should be clarified in the Discussion. If this confidence level is low, that would call into question the very goal of such a modeling approach.

      These are very good questions. Let us first comment on the parameter uniqueness. We believe, like in E. Marder’s work on ion channels expression in neurons, that the synapse has the possibility to adapt its internal parameters (proteins number, transition rates, etc) to provide a given functioning behaviour. As a by-product, there is non uniqueness of parameters associated with behavior. Additionally, since our model is able to reproduce 9 published experimental outcomes with a single set of parameters, it is a functioning synapse with adjusted parameters which output the expected behaviours. Thus by extrapolation, our confidence in the further predictions is high. We modified sentences in the discussion section to argue this point (page 21, line 707).

      Let us comment now on increasing the complexity. To our best, we strived to design a plasticity readout as simple as possible yet providing a functioning synapse. Given our success to reproduce 9 published experimental outcomes with a single set of parameters, adding more complexity would be akin to overfitting.

      4) The authors presented a simplified mechanistic dynamical Markov chain process to prove that the "geometric readout" step is implementable as a dynamical process, at least in principle. However, a more realistic biochemical implementation of the proposed "region indicator" variables may be complex and not guaranteed to be robust to noise. While the authors acknowledge and touch upon some of these issues in their discussion, it is important that the authors will prove in future work that the "geometric readout" is implementable as a biochemical reaction network. Barring such implementation, one must be extra careful when claiming advantages of this approach as compared to modeling work that attempts to reconstruct the entire biochemical pathways of LTP/LTD induction.

      We acknowledge this issue and agree this would be an interesting subject for future work.

    1. Author Response

      Reviewer #1 (Public Review):

      1) Comment: To determine the effect of diseased monocytes on retinal health, light-injured mouse retinas were injected with monocytes isolated from AMD patients (Figure 1 - figure supplement 1). This resulted in a reduction in photoreceptor number and ERG b-wave amplitude. However, the light-injured control eye was injected with PBS only, so no cells were present. The reasoning for using this control was not provided. The appropriate injection control would include monocytes isolated from non-AMD patients. This control should be performed side-by-side with cells from AMD patients.

      We thank the reviewer for this important comment. The purpose of the current study was to identify the macrophage subtype that may be associated with cell death in aAMD. We have previously reported that macrophages from AMD patient demonstrate a different phenotype compared with healthy patient in the rodent model for laser induced CNV (Hagbi-Levi S et al, 2016). Per the reviewer comment, we have performed additional experiments to assess the effect of monocytes from healthy controls in the photic retinal injury model. Results showed that monocytes from AMD and healthy patients exert different impact on the retina in this rodent model for aAMD. Interestingly, we found that monocytes from healthy patients were more neurotoxic to photoreceptors compared with monocytes from AMD patients. These results are included in the revised ms. as Figure 1- figure supplement 1H. A possible explanation for these findings is discussed in lines 179-190 of the revised manuscript. This finding reinforces the idea that the use of monocytes from AMD patients in the experiments is required to obtain a comprehensive understanding of their involvement in the progression of the disease.

      2) Comment: The authors hypothesize, from the experiments presented in Figure 1 - figure supplement 1, that the injected monocytes generated macrophages in the retina, which were responsible for the observed neurotoxicity (Lines 143-145). However, no direct evidence was presented. This idea should be tested in vivo. This could be done by injecting tracer-labeled human AMD-derived monocytes into light-injured mouse retinas. If the authors' hypothesis is true, collected retinas should contain tracer-labeled cells that express macrophage markers. Tracer-labeled M2a macrophage cells should be present since subsequent experiments identify this subclass as being associated with retinal cell death.

      Thank you for this important comment. To address the reviewers comment, retinal section from mice exposed to photic-retinal injury and injected with Dio-tracer labelled monocytes were stained with two M2a macrophages markers, CD206 (mannose receptor) and VEGF (Kadomoto, S et al, 2022; Jayasingam SD et al, 2019). Interestingly, we found co-localization of Dio-tracer staining (representing the injected human macrophages) with CD206 and VEGF markers in monocytes localized in different retinal layers, but not in monocytes remaining in the vitreous cavity. These data indicate that M2a markers are expressed during the polarization of monocytes into M2a phenotype which is maintained only upon entry into the retina tissue. These results were included in Figure 1- figure supplement 1K-S and discussed in the revised manuscript in lines 179-182.

      3) Comment: Photoreceptor number and b-wave amplitudes were measured in light-injured retinas injected with one of four macrophage cell types generated from human AMD-derived monocytes. The authors conclude that only injection of M2a cells reduced photoreceptor number and b-wave amplitudes (Figure 1C, E). This may be true, but it is difficult for the reader to make a conclusion (especially in Fig. 1E) due to the large error bars and five different traces overlapping each other. To make these results easier to interpret, graph control cells with only one experimental sample (cell type) at a time.

      Thank you for this comment. Per the reviewer comment, the graphs were modified in the revised ms. (Figure 1, panel H-K).

      4) Comment: Most injected macrophages were located in the vitreous. In the case of M2a cells, the authors note that "several of the cells migrated across the retinal layers reaching the subretinal space" (Lines 167,168). One possible explanation for why M0, M1, and M2c macrophages did not induce retinal degeneration is that they did not migrate to the subretinal space and around the optic nerve head. Supplementary figures should be added to demonstrate that this is not the case.

      Thank you for this comment. To address the reviewer comment we compared the migration patterns of the different macrophage phenotypes following intravitreal injection in mice exposed to photic-injury. Our results indicated that M0, M1 and M2c macrophages, similarly to M2a macrophages, migrated to the subretinal space and around the optic nerve. Thus, the neurotoxic effect of M2a is not explained by their capacity to infiltrate the retinal tissues. These results was included in Figure 1- figure supplement 2 E-H of the revised manuscript. These results are supported by our ex-vivo experiments, showing that co-culture of M2a macrophages with a retinal explants was associated with increased photoreceptor cells death compared to M1 macrophages. The results are presented and discussed in the revised manuscript in lines 200-203.

      5) Comment: Figure 1 - figure supplement 2: Panel A, B cells were stained with CD206 to demonstrate the presence of M2a macrophages (panel B). The authors conclude that panel A contains M1 and panel B contains M2a cells. The lack of CD206 expression illustrates that panel A cells are not M2a macrophages but do not demonstrate they are M1 macrophages. A control using an M1 cell marker is necessary to show that panel A cells are M1 and M1 cells are not detected in M2a cultures.

      Thank you for this comment. We have validated the phenotype of each macrophages subtype by qPCR (Figure 1 panel A). To further address the reviewer comment, we have performed additional immunocytochemistry for M1 macrophages using anti-CD80 antibody which is utilized as M1 macrophages marker (Bertani FR et al.2017). Results of the staining confirmed the identity of the M1 macrophages. These new results were included in Figure 1- figure supplement 2A, and are discussed in lines 168-170.

      6) Comment: Ex vivo, apoptotic photoreceptor and RPE cells are observed when cultured with M2a macrophages (Figure 2). Do injected M2a cells also induce apoptosis of RPE cells in vivo? This is important to establish that retinal explants are a good model for in vivo experiments.

      Thank you for this comment. To address the reviewer comment, we assessed RPE apoptosis (using TUNEL, Caspase 3 staining and RPE65 marker) after M2A cells delivery, in the in-vivo photic injury model. We could not detect apoptotic signal in the RPE layers 7 days after photic injury and therefore could not evaluate the effect of M2a macrophages on the RPE cells in-vivo (see Author response image 1). One possible explanation is that RPE cells that have undergone apoptosis are rapidly removed from the damaged tissue and are no longer detectable unlike photoreceptors. Furthermore, a study that investigated the impact of bright light on RPE cells in-vivo, showed that although RPE cells undergone structural and chemical modifications after photic-injury, TUNEL signal was not detected because RPE cell die by necrosis mechanism and not apoptosis (Jaadane I et al, 2017). Other studies validated that blue light induces RPE necrosis (Song W et al, 2022; Mohamed A et al, 2022). Taken together, it seems that ex-vivo retinal explant and in-vivo photic injury both simulate the mechanism of retinal cell death. However, the use of ex-vivo model allows for establishing the direct impact of M2a macrophages on retina in non-inflammatory context.

      Author responnse image 1.

      7) Comment: Reactive oxygen species (ROS) production was measured to determine if M2a cell-mediated neurotoxicity was due to oxidative stress. It is concluded that a ROS increase is partly responsible (Line 218). The data do not support this conclusion. ROS was detected in cultured M2a macrophages. More importantly, however, there was no increase in oxidative damage in vivo. The in vivo and cell culture results contradict each other so no conclusion can be made. The lack of in vivo confirmation weakens the argument that ROS drives M2a neurotoxicity. Text suggesting a role for ROS in neurotoxicity should be appropriately edited (Lines including 218, 244, 401,406,481).

      Thank you for this comment. The manuscript was revised according to the reviewer suggestion (Lines 250-256).

      8) Comment: The authors ask if the photoreceptor cell death is cytokine-mediated. Multiple cytokines were enriched in M2a-conditioned media. Of particular interest were CCR1 ligands MPIF1 and MCP4. The implication is that these two ligands mediate the M2a macrophages to photoreceptor cell death through CCR1. However, there is no attempt to show that either MPIF1 or MCP4 are present in vivo, or are sufficient to induce the retinal response observed. This could be demonstrated by injection of MPIF1 or MCP4. Evidence that either ligand phenocopies M2a macrophage injection would be direct evidence that CCR1 ligands activate the retinal response. Furthermore, co-injection with BX174 should block the effect of these ligands if they work through CCR1.

      Thank you for this comment. The identification of CCR1 ligands expression from M2a polarized macrophages directed our decision to study CCR1 in the context of atrophic AMD. We do not claim that these specific CCR1 ligands are sufficient to activate CCR1 and exert retinal injury. The mechanism is likely more complex. Yet, to address the reviewer comment, we have performed the experiments suggested by the reviewer. Mice were exposed to photic injury and immediately injected in one eye with MPIF1, MCP-4, or a combination of both and in second eye with PBS as vehicle. Intravitreal cytokines delivery was repeated two days later (following the half-life time of these cytokines) and ERG were recorded two days after the last injection. Injection of cytokines at a concentration of 300 ng per eye did not exacerbated photoreceptor death. Then, the same experiment was repeated with two higher concentrations of cytokine, 1.2 ug/eye and 2 ug/eye, but no changes are observed between the cytokines treated-eyes and the vehicle treated-eyes. Based on previous studies reporting the physiological concentration of different cytokines in eyes of un/healthy individuals and on experiments in which different cytokines are injected in rodent eye (Estevao C et al, 2021. Zeng Y et al, 2019; Roybal CN et al, 2018; Mugisho OO et al, 2018), the cytokine concentrations used in our experiment are in the range in which effect on the retina is expected.

      It is likely that a synergistic effect of M2a-secreted proteins in a particular microenvironment is necessary to increase the level of retinal damage (Bartee E et al, 2013). It is also likely that in the photic retinal injury model there is upregulation of cytokines that may mask additional delivery of exogenous cytokines. Comprehensive understanding of the complex interactions of these cytokines during retinal degeneration is beyond the scope of the current manuscript which is not focus on identifying ligand-induced CCR1 activation and its consequences. Additionally, we suggest that due to cytokine redundancy (Nicola NA; 1994), demonstrating that MPIF-4 or MCP-3 can increase photoreceptor death is not required for proving CCR1 receptor involvement.

    1. Author Response:

      Reviewer #1:

      Charpentier et al. use facial recognition technology to show that mothers in a group of mandrills lead their offspring to associate with phenotypically similar offspring. Mandrills are a species of primate that live in large, matrilineal troops, with a single, dominant male that fathers the majority of the offspring. Male breeder turnover and extra-pair mating by females can lead to variation in relatedness between group members and the potential for kin-selected benefits from preferentially cooperating with closer relatives within the group. The authors argue that the strategy of influencing the social network of their offspring could be favoured by "second-order kin selection", a mechanism by which inclusive fitness benefits are accrued to female actors through kin-selected benefits to their offspring. This interpretation is supported by a theoretical model.

      The paper highlights a previously unappreciated mechanism for favouring association between non-kin in social groups and also contributes a nice insight into the complexity of social interactions in a relatively understudied wild primate species. The conclusions are strengthened by data showing associations between mothers were not influenced by the facial similarity of their offspring -- this suggests that mothers are making decisions based on the appearance of offspring and not their mothers.

      Some remaining questions regarding the strength of the authors' interpretation exist: Given the challenges of studying mandrills in the field, the fact that the study reports data from a single group is understandable but potential issues remain with the independence of data points. There may be an additional issue arising from the fact that this troop is semi-captive.

      The study group is not semi-captive. Instead, it originated from two release events of a few captive individuals into the wild (in 2002 and 2006). The population is now composed of more than 250 individuals and all of them, except for 7 founder females (<3%), were born in the wild. In addition, the study group is not fed and occasionally wanders into a fenced protected area. Fences of the park do not represent a boundary for mandrills and most of the time (c.a. 80% of days), the study group ranges outside the park. We have clarified this misunderstanding.

      Regarding the independence of data points, we would be grateful if this reviewer could clarify her/his thoughts. As a tentative response, we indeed have access to a single (although large) study group, but that’s unfortunately often the case when studying primates or other large mammals. Regarding our study questions, we have clearly demonstrated increased nepotism among paternally related mandrills in two different social groups (Charpentier et al. 2007: semi-captive mandrills; Charpentier et al. 2020: wild mandrills). More generally, we do not see any parsimonious explanations for why the studied mandrills would behave or experienced selective pressures that may have differently shaped their genetic structure and social organization compared to other wild mandrill groups.

      The number of genotyped offspring is relatively small (n = 15) and paternity is inferred from the identity of the dominant male. However, the authors also refer to the fact that it's normal for female mandrills to mate with several males during ovulation.

      Indeed, both sexes mate promiscuously during the mating season. We have very recently (June 2022) obtained new genetic profiles for a subset of the study infants (it took two years to obtain these data). We have now increased our sample size of infants with a known father, from 15 to 32. With these new data, we were able to distinguish between four categories of infant-infant dyads: those sharing the same father (PHS), those not sharing the same father (not PHS), those conceived during the same alpha male tenure, and those that were not (both infants with unknown dads). The graph below shows the average facial distance among individuals for each of these four categories. It shows that infants conceived during the same alpha male tenure are significantly more similar to each other than infants sired by different fathers or during the tenure of different alpha males, but they are also significantly less similar to each other than infants born to the same father (the four categories are all significantly different from each other, except when comparing infants born to different fathers with those conceived during different alpha male tenures). As suggested by this reviewer, the fact that females mate predominantly with the alpha male, but to some extent also with other males, likely explains the difference between “same father” and “same alpha male tenure”. Importantly, however, considering all infants conceived during the same alpha male tenure as “PHS” is highly conservative. It is thus likely that knowing the paternity of every infant would produce even clearer effects (and indeed, increasing the data set from 15 to 32 strengthened this result). We have now updated this result (first model) based on this new sample.

      What evidence is there to support a beneficial effect of nepotism in this species?

      In mandrills, females who affiliate more (groom more/associate more) with their groupmates (kin or non-kin) during juvenility also reproduce 1 year earlier than those females that are poorly socially integrated (Charpentier et al. 2012). These results are similar to what is known in many mammalian species (see for review Snyder-Mackler et al. 2020). However, the positive effects of a rich social life are generally triggered by all group members, not only close kin. However, if beneficial social relationships impact the direct fitness of individuals, as reported in mandrills and other species, then kin selection theory predicts that these effects should further translate into indirect fitness benefits.

      We have now added this relevant reference (Charpentier et al. 2012) in the revised version of our manuscript and present the results of this early study on mandrills.

      What form could nepotism take and does it necessarily have to involve full sibs?

      We are unsure why this reviewer is mentioning full-sibs here. For this reviewer information, on the 2556 study dyads (model 1 on the impact of maternal and paternal origins on facial distance), only one dyad was a full-sib pair. Full-sibs are therefore very rare in the study population due to male migration patterns and generally short alpha male tenures.

      If a female did not associate with offspring as shown here, would nepotistic interactions simply arise between her offspring and offspring that were less facially similar?

      We guess that facial similarity would not be a predictor of spatial association anymore. Indeed, we think that young mandrills do not use self-referent phenotype matching, precluding the self-evaluation of those infants that look like them. However, as stated below, we cannot fully exclude the possibility that other social partners, such as fathers, may also influence infant-infant relationships, although we think that this alternative mechanism is less parsimonious than the one we propose and test.

      Reviewer #2:

      This paper uses data on patterns of spatial association and facial similarity in mandrills to develop a new hypothesis for the evolution of kin recognition based on facial cues. Previous work on this system has shown that, among females, paternal half-sibs resemble each other visually more than maternal half-sisters do. The authors hypothesise that this paternally inherited facial similarity provides opportunities for kin selection, but it is unclear how offspring themselves could recognise kin using phenotype matching since they are unable to see their own face. One answer to this puzzle is that third parties -- mothers -- may promote social interactions between their own offspring and other offspring that resemble them since these other offspring are likely to share the same father. In support of this hypothesis, the authors find that mothers and offspring show spatial proximity to infants that are facially more similar than average. They also use an analytical evolutionary model to confirm the logic of this hypothesis. The model shows that mothers can gain inclusive fitness benefits by encouraging reciprocal social interaction among their offspring and other paternally-related offspring. They term this idea 'second-order' kin selection and identify a range of other circumstances in which it might play an important role in shaping the evolution of social behaviour.

      The main strengths of the paper are the interesting mandrill data and the cutting-edge methods used to analyse facial similarity, which have stimulated the development of a theoretically interesting hypothesis about the evolution of facially based kin recognition. The theoretical model enhances the generality and rigour of the work. The paper will be of wide interest and the concept of second-order kin selection may be applicable to other social circumstances, such as interactions among in-laws in close-knit family groups. Thus, I can see that this paper will be a stimulus for future work.

      We are grateful for these positive comments.

      The data are, I think, rather overinterpreted in terms of the degree to which they support the hypothesis. The spatial proximity data are interesting, but on their own, they are not definitive support for the hypothesis or model. A more critical approach to the hypothesis, clearly setting out the limitations of the data, and what tests in future could be used to falsify the hypothesis or model, would make for a stronger paper.

      We agree with this general comment and have addressed it by 1. Adding a model on grooming relationships between females and infants, 2. Toning down our interpretation throughout the manuscript and 3. Propose future directions of research.

      Overall the authors have presented data that support a fascinating new mechanism by which natural selection can influence social interactions among the members of family groups, in potentially surprising ways. I also find it remarkable that 60 years after the development of the kin selection theory new implications of this theory are still being uncovered. The concept of second-order kin selection may prove important in understanding the evolution of social organisation and behaviour in species that live in groups containing a mixture of kin and non-kin, such as many primates and of course humans.

      We are grateful to this reviewer for this very positive comment. We fully agree with the fact that 60 years after the kin selection theory has emerged, we are still discovering further implications!

      Reviewer #3:

      This is a very interesting and impressive manuscript. It is complex in its multiple components, and in some ways that makes it a difficult manuscript to evaluate. There is a lot in it, including empirical analyses of a face dataset and of behavioral association data, combined with a theoretical model.

      We are very grateful for this positive comment and are glad that you liked our manuscript.

      The three main findings are: 1) Paternal siblings look alike (similar to, and building on, a recent manuscript the authors published elsewhere); 2) Infants that are more facially similar tend to associate; and 3) mothers tend to be found in association with other unrelated infants that look more like their own infants. Such results are interesting, and indeed one potential interpretation, perhaps even the most likely, is that mothers are behaving in such a way that promotes association between their own infants and the paternal kin of their infants.

      Nonetheless, the evidence provided is logically only consistent with the authors' hypothesis, rather than being strong direct evidence for it. As such, the current framing and indeed the title, "Primate mothers promote proximity between their offspring and infants who look like them", are both problematic. (In addition, the title should be about mandrills, not "primates", since this manuscript does not provide evidence from any other species.) The evidence provided is consistent with the hypothesis, but also consistent with other potential hypotheses. The evidence given to dismiss other potential hypotheses is not strong, and rests on the fact that many males are not around all year to influence things, and that "males that were present during a given reproductive cycle are not responsible for maintaining proximity with either infants or their mothers (MJEC and BRT, pers. obs.)".

      We agree with this comment. Although, after examining several alternative mechanisms, in the light of the natural history of mandrills we are confident that the proposed mechanism is at work in that species, although we cannot firmly exclude some of these alternative mechanisms. To address this comment, we have changed the title of our manuscript that now reads “Mandrill mothers associate with infants who look like their own offspring using phenotype matching”. We have also included an additional model on grooming relationships (see response to R1) and have toned down the interpretation of our results throughout our revised manuscript. Finally, we have further discussed alternative scenario, in particular the one involving fathers (see details above).

      My opinion is that these are really interesting analyses and data, which are being somewhat undermined by the insistence that only one hypothesis can explain the observed association patterns. It could easily be presented differently, as a demonstration that paternal siblings look alike and that they associate. The authors could then go on to explore different possible explanations for this using their association data, make the case that maternal behavior is the most plausible (but not the only) explanation, and present their model of how such behavior could bring fitness benefits.

      In my view, such a presentation would be both more cautious and more appropriate, without in any way reducing the impact or importance of the data. In the current iteration, I think there are issues because the data do not provide sufficient support for the surety of the title and conclusion, as presented.

      We think that the current organization of our manuscript was not that different from the one proposed here and follows a reasoning already proposed in a former manuscript (Charpentier et al. 2020). Indeed, we first start by reminding the reader what we already know from that previous studies: paternal siblings look alike and they associate. We then go on exploring different mechanisms. That being said, and as suggested, we have been more cautious in interpreting our results, that are indeed only correlative.

    1. Author Response

      Reviewer #1 (Public Review):

      In this work George et al. describe RatInABox, a software system for generating surrogate locomotion trajectories and neural data to simulate the effects of a rodent moving about an arena. This work is aimed at researchers that study rodent navigation and its neural machinery.

      Strengths:

      • The software contains several helpful features. It has the ability to import existing movement traces and interpolate data with lower sampling rates. It allows varying the degree to which rodents stay near the walls of the arena. It appears to be able to simulate place cells, grid cells, and some other features.

      • The architecture seems fine and the code is in a language that will be accessible to many labs.

      • There is convincing validation of velocity statistics. There are examples shown of position data, which seem to generally match between data and simulation.

      Weaknesses:

      • There is little analysis of position statistics. I am not sure this is needed, but the software might end up more powerful and the paper higher impact if some position analysis was done. Based on the traces shown, it seems possible that some additional parameters might be needed to simulate position/occupancy traces whose statistics match the data.

      Thank you for this suggestion. We have added a new panel to figure 2 showing a histogram of the time the agent spends at positions of increasing distance from the nearest wall. As you can see, RatInABox is a good fit to the real locomotion data: positions very near the wall are under-explored (in the real data this is probably because whiskers and physical body size block positions very close to the wall) and positions just away from but close to the wall are slightly over explored (an effect known as thigmotaxis, already discussed in the manuscript).

      As you correctly suspected, fitting this warranted a new parameter which controls the strength of the wall repulsion, we call this “wall_repel_strength”. The motion model hasn’t mathematically changed, all we did was take a parameter which was originally a fixed constant 1, unavailable to the user, and made it a variable which can be changed (see methods section 6.1.3 for maths). The curves fit best when wall_repel_strength ~= 2. Methods and parameters table have been updated accordingly. See Fig. 2e.

      • The overall impact of this work is somewhat limited. It is not completely clear how many labs might use this, or have a need for it. The introduction could have provided more specificity about examples of past work that would have been better done with this tool.

      At the point of publication we, like yourself, also didn’t know to what extent there would be a market for this toolkit however we were pleased to find that there was. In its initial 11 months RatInABox has accumulated a growing, global user base, over 120 stars on Github and north of 17,000 downloads through PyPI. We have accumulated a list of testimonials[5] from users of the package vouching for its utility and ease of use, four of which are abridged below. These testimonials come from a diverse group of 9 researchers spanning 6 countries across 4 continents and varying career stages from pre-doctoral researchers with little computational exposure to tenured PIs. Finally, not only does the community use RatInABox they are also building it: at the time of writing RatInABx has received logged 20 GitHub “Issues” and 28 “pull requests” from external users (i.e. those who aren’t authors on this manuscript) ranging from small discussions and bug-fixes to significant new features, demos and wrappers.

      Abridged testimonials:

      ● “As a medical graduate from Pakistan with little computational background…I found RatInABox to be a great learning and teaching tool, particularly for those who are underprivileged and new to computational neuroscience.” - Muhammad Kaleem, King Edward Medical University, Pakistan

      ● “RatInABox has been critical to the progress of my postdoctoral work. I believe it has the strong potential to become a cornerstone tool for realistic behavioural and neuronal modelling” - Dr. Colleen Gillon, Imperial College London, UK

      ● “As a student studying mathematics at the University of Ghana, I would recommend RatInABox to anyone looking to learn or teach concepts in computational neuroscience.” - Kojo Nketia, University of Ghana, Ghana

      ● “RatInABox has established a new foundation and common space for advances in cognitive mapping research.” - Dr. Quinn Lee, McGill, Canada

      The introduction continues to include the following sentence highlighting examples of past work which relied of generating artificial movement and/or neural dat and which, by implication could have been done better (or at least accelerated and standardised) using our toolbox.

      “Indeed, many past[13, 14, 15] and recent[16, 17, 18, 19, 6, 20, 21] models have relied on artificially generated movement trajectories and neural data.”

      • Presentation: Some discussion of case studies in Introduction might address the above point on impact. It would be useful to have more discussion of how general the software is, and why the current feature set was chosen. For example, how well does RatInABox deal with environments of arbitrary shape? T-mazes? It might help illustrate the tool's generality to move some of the examples in supplementary figure to main text - or just summarize them in a main text figure/panel.

      Thank you for this question. Since the initial submission of this manuscript RatInABox has been upgraded and environments have become substantially more “general”. Environments can now be of arbitrary shape (including T-mazes), boundaries can be curved, they can contain holes and can also contain objects (0-dimensional points which act as visual cues). A few examples are showcased in the updated figure 1 panel e.

      To further illustrate the tools generality beyond the structure of the environment we continue to summarise the reinforcement learning example (Fig. 3e) and neural decoding example in section 3.1. In addition to this we have added three new panels into figure 3 highlighting new features which, we hope you will agree, make RatInABox significantly more powerful and general and satisfy your suggestion of clarifying utility and generality in the manuscript directly.

      On the topic of generality, we wrote the manuscript in such a way as to demonstrate how the rich variety of ways RatInABox can be used without providing an exhaustive list of potential applications. For example, RatInABox can be used to study neural decoding and it can be used to study reinforcement learning but not because it was purpose built with these use-cases in mind. Rather because it contains a set of core tools designed to support spatial navigation and neural representations in general. For this reason we would rather keep the demonstrative examples as supplements and implement your suggestion of further raising attention to the large array of tutorials and demos provided on the GitHub repository by modifying the final paragraph of section 3.1 to read:

      “Additional tutorials, not described here but available online, demonstrate how RatInABox can be used to model splitter cells, conjunctive grid cells, biologically plausible path integration, successor features, deep actor-critic RL, whisker cells and more. Despite including these examples we stress that they are not exhaustive. RatInABox provides the framework and primitive classes/functions from which highly advanced simulations such as these can be built.”

      Reviewer #3 (Public Review):

      George et al. present a convincing new Python toolbox that allows researchers to generate synthetic behavior and neural data specifically focusing on hippocampal functional cell types (place cells, grid cells, boundary vector cells, head direction cells). This is highly useful for theory-driven research where synthetic benchmarks should be used. Beyond just navigation, it can be highly useful for novel tool development that requires jointly modeling behavior and neural data. The code is well organized and written and it was easy for us to test.

      We have a few constructive points that they might want to consider.

      • Right now the code only supports X,Y movements, but Z is also critical and opens new questions in 3D coding of space (such as grid cells in bats, etc). Many animals effectively navigate in 2D, as a whole, but they certainly make a large number of 3D head movements, and modeling this will become increasingly important and the authors should consider how to support this.

      Agents now have a dedicated head direction variable (before head direction was just assumed to be the normalised velocity vector). By default this just smoothes and normalises the velocity but, in theory, could be accessed and used to model more complex head direction dynamics. This is described in the updated methods section.

      In general, we try to tread a careful line. For example we embrace certain aspects of physical and biological realism (e.g. modelling environments as continuous, or fitting motion to real behaviour) and avoid others (such as the biophysics/biochemisty of individual neurons, or the mechanical complexities of joint/muscle modelling). It is hard to decide where to draw but we have a few guiding principles:

      1. RatInABox is most well suited for normative modelling and neuroAI-style probing questions at the level of behaviour and representations. We consciously avoid unnecessary complexities that do not directly contribute to these domains.

      2. Compute: To best accelerate research we think the package should remain fast and lightweight. Certain features are ignored if computational cost outweighs their benefit.

      3. Users: If, and as, users require complexities e.g. 3D head movements, we will consider adding them to the code base.

      For now we believe proper 3D motion is out of scope for RatInABox. Calculating motion near walls is already surprisingly complex and to do this in 3D would be challenging. Furthermore all cell classes would need to be rewritten too. This would be a large undertaking probably requiring rewriting the package from scratch, or making a new package RatInABox3D (BatInABox?) altogether, something which we don’t intend to undertake right now. One option, if users really needed 3D trajectory data they could quite straightforwardly simulate a 2D Environment (X,Y) and a 1D Environment (Z) independently. With this method (X,Y) and (Z) motion would be entirely independent which is of unrealistic but, depending on the use case, may well be sufficient.

      Alternatively, as you said that many agents effectively navigate in 2D but show complex 3D head and other body movements, RatInABox could interface with and feed data downstream to other softwares (for example Mujoco[11]) which specialise in joint/muscle modelling. This would be a very legitimate use-case for RatInABox.

      We’ve flagged all of these assumptions and limitations in a new body of text added to the discussion:

      “Our package is not the first to model neural data[37, 38, 39] or spatial behaviour[40, 41], yet it distinguishes itself by integrating these two aspects within a unified, lightweight framework. The modelling approach employed by RatInABox involves certain assumptions:

      1. It does not engage in the detailed exploration of biophysical[37, 39] or biochemical[38] aspects of neural modelling, nor does it delve into the mechanical intricacies of joint and muscle modelling[40, 41]. While these elements are crucial in specific scenarios, they demand substantial computational resources and become less pertinent in studies focused on higher-level questions about behaviour and neural representations.

      2. A focus of our package is modelling experimental paradigms commonly used to study spatially modulated neural activity and behaviour in rodents. Consequently, environments are currently restricted to being two-dimensional and planar, precluding the exploration of three-dimensional settings. However, in principle, these limitations can be relaxed in the future.

      3. RatInABox avoids the oversimplifications commonly found in discrete modelling, predominant in reinforcement learning[22, 23], which we believe impede its relevance to neuroscience.

      4. Currently, inputs from different sensory modalities, such as vision or olfaction, are not explicitly considered. Instead, sensory input is represented implicitly through efficient allocentric or egocentric representations. If necessary, one could use the RatInABox API in conjunction with a third-party computer graphics engine to circumvent this limitation.

      5. Finally, focus has been given to generating synthetic data from steady-state systems. Hence, by default, agents and neurons do not explicitly include learning, plasticity or adaptation. Nevertheless we have shown that a minimal set of features such as parameterised function-approximator neurons and policy control enable a variety of experience-driven changes in behaviour the cell responses[42, 43] to be modelled within the framework.

      • What about other environments that are not "Boxes" as in the name - can the environment only be a Box, what about a circular environment? Or Bat flight? This also has implications for the velocity of the agent, etc. What are the parameters for the motion model to simulate a bat, which likely has a higher velocity than a rat?

      Thank you for this question. Since the initial submission of this manuscript RatInABox has been upgraded and environments have become substantially more “general”. Environments can now be of arbitrary shape (including circular), boundaries can be curved, they can contain holes and can also contain objects (0-dimensional points which act as visual cues). A few examples are showcased in the updated figure 1 panel e.

      Whilst we don’t know the exact parameters for bat flight users could fairly straightforwardly figure these out themselves and set them using the motion parameters as shown in the table below. We would guess that bats have a higher average speed (speed_mean) and a longer decoherence time due to increased inertia (speed_coherence_time), so the following code might roughly simulate a bat flying around in a 10 x 10 m environment. Author response image 1 shows all Agent parameters which can be set to vary the random motion model.

      Author response image 1.

      • Semi-related, the name suggests limitations: why Rat? Why not Agent? (But its a personal choice)

      We came up with the name “RatInABox” when we developed this software to study hippocampal representations of an artificial rat moving around a closed 2D world (a box). We also fitted the random motion model to open-field exploration data from rats. You’re right that it is not limited to rodents but for better or for worse it’s probably too late for a rebrand!

      • A future extension (or now) could be the ability to interface with common trajectory estimation tools; for example, taking in the (X, Y, (Z), time) outputs of animal pose estimation tools (like DeepLabCut or such) would also allow experimentalists to generate neural synthetic data from other sources of real-behavior.

      This is actually already possible via our “Agent.import_trajectory()” method. Users can pass an array of time stamps and an array of positions into the Agent class which will be loaded and smoothly interpolated along as shown here in Fig. 3a or demonstrated in these two new papers[9,10] who used RatInABox by loading in behavioural trajectories.

      • What if a place cell is not encoding place but is influenced by reward or encodes a more abstract concept? Should a PlaceCell class inherit from an AbstractPlaceCell class, which could be used for encoding more conceptual spaces? How could their tool support this?

      In fact PlaceCells already inherit from a more abstract class (Neurons) which contains basic infrastructure for initialisation, saving data, and plotting data etc. We prefer the solution that users can write their own cell classes which inherit from Neurons (or PlaceCells if they wish). Then, users need only write a new get_state() method which can be as simple or as complicated as they like. Here are two examples we’ve already made which can be found on the GitHub:

      Author response image 2.

      Phase precession: PhasePrecessingPlaceCells(PlaceCells)[12] inherit from PlaceCells and modulate their firing rate by multiplying it by a phase dependent factor causing them to “phase precess”.

      Splitter cells: Perhaps users wish to model PlaceCells that are modulated by recent history of the Agent, for example which arm of a figure-8 maze it just came down. This is observed in hippocampal “splitter cell”. In this demo[1] SplitterCells(PlaceCells) inherit from PlaceCells and modulate their firing rate according to which arm was last travelled along.

      • This a bit odd in the Discussion: "If there is a small contribution you would like to make, please open a pull request. If there is a larger contribution you are considering, please contact the corresponding author3" This should be left to the repo contribution guide, which ideally shows people how to contribute and your expectations (code formatting guide, how to use git, etc). Also this can be very off-putting to new contributors: what is small? What is big? we suggest use more inclusive language.

      We’ve removed this line and left it to the GitHub repository to describe how contributions can be made.

      • Could you expand on the run time for BoundaryVectorCells, namely, for how long of an exploration period? We found it was on the order of 1 min to simulate 30 min of exploration (which is of course fast, but mentioning relative times would be useful).

      Absolutely. How long it takes to simulate BoundaryVectorCells will depend on the discretisation timestep and how many neurons you simulate. Assuming you used the default values (dt = 0.1, n = 10) then the motion model should dominate compute time. This is evident from our analysis in Figure 3f which shows that the update time for n = 100 BVCs is on par with the update time for the random motion model, therefore for only n = 10 BVCs, the motion model should dominate compute time.

      So how long should this take? Fig. 3f shows the motion model takes ~10-3 s per update. One hour of simulation equals this will be 3600/dt = 36,000 updates, which would therefore take about 72,000*10-3 s = 36 seconds. So your estimate of 1 minute seems to be in the right ballpark and consistent with the data we show in the paper.

      Interestingly this corroborates the results in a new inset panel where we calculated the total time for cell and motion model updates for a PlaceCell population of increasing size (from n = 10 to 1,000,000 cells). It shows that the motion model dominates compute time up to approximately n = 1000 PlaceCells (for BoundaryVectorCells it’s probably closer to n = 100) beyond which cell updates dominate and the time scales linearly.

      These are useful and non-trivial insights as they tell us that the RatInABox neuron models are quite efficient relative to the RatInABox random motion model (something we hope to optimise further down the line). We’ve added the following sentence to the results:

      “Our testing (Fig. 3f, inset) reveals that the combined time for updating the motion model and a population of PlaceCells scales sublinearly O(1) for small populations n > 1000 where updating the random motion model dominates compute time, and linearly for large populations n > 1000. PlaceCells, BoundaryVectorCells and the Agent motion model update times will be additionally affected by the number of walls/barriers in the Environment. 1D simulations are significantly quicker than 2D simulations due to the reduced computational load of the 1D geometry.”

      And this sentence to section 2:

      “RatInABox is fundamentally continuous in space and time. Position and velocity are never discretised but are instead stored as continuous values and used to determine cell activity online, as exploration occurs. This differs from other models which are either discrete (e.g. “gridworld” or Markov decision processes) or approximate continuous rate maps using a cached list of rates precalculated on a discretised grid of locations. Modelling time and space continuously more accurately reflects real-world physics, making simulations smooth and amenable to fast or dynamic neural processes which are not well accommodated by discretised motion simulators. Despite this, RatInABox is still fast; to simulate 100 PlaceCell for 10 minutes of random 2D motion (dt = 0.1 s) it takes about 2 seconds on a consumer grade CPU laptop (or 7 seconds for BoundaryVectorCells).”

      Whilst this would be very interesting it would likely represent quite a significant edit, requiring rewriting of almost all the geometry-handling code. We’re happy to consider changes like these according to (i) how simple they will be to implement, (ii) how disruptive they will be to the existing API, (iii) how many users would benefit from the change. If many users of the package request this we will consider ways to support it.

      • In general, the set of default parameters might want to be included in the main text (vs in the supplement).

      We also considered this but decided to leave them in the methods for now. The exact value of these parameters are subject to change in future versions of the software. Also, we’d prefer for the main text to provide a low-detail high-level description of the software and the methods to provide a place for keen readers to dive into the mathematical and coding specifics.

      • It still says you can only simulate 4 velocity or head directions, which might be limiting.

      Thanks for catching this. This constraint has been relaxed. Users can now simulate an arbitrary number of head direction cells with arbitrary tuning directions and tuning widths. The methods have been adjusted to reflect this (see section 6.3.4).

      • The code license should be mentioned in the Methods.

      We have added the following section to the methods:

      6.6 License RatInABox is currently distributed under an MIT License, meaning users are permitted to use, copy, modify, merge publish, distribute, sublicense and sell copies of the software.

    1. Author Response

      Reviewer #2 (Public Review):

      “The authors wish to relate beat-to-beat coordination of cardiac function (in this case as measured left ventricular pressure) to the activity of sympathetic neuron spiking within the stellate ganglion. A strength includes the challenging measurements from multiple stellate neuron activity over long durations in situ in the anesthetized pig.”

      We thank the reviewer for their feedback.

      “A major and overriding weakness is the founding assumption of the analysis that the underlying sympathetic neurons are all cardiac functioning in nature - an assumption that is overwhelmingly unlikely given the evidence in other species including humans that stellate postganglionic neurons are functionally mixed and have functional noncardiac targets. The use of broad and poorly explained/defined terms such as "event entropy" is difficult to follow and find meaning from. The manuscript is filled with difficult-to-follow text like "The neural specificity metric (Sudarshan et al., 2021). Fig. 5", is used to evaluate the degree to which neural activity is biased toward control target states taken here as LVP" and "The neural specificity is reduced from a multivariate signal to a univariate signal by computing the Shannon entropy at each timestamp of the mapped neural specificity metric". The figures are difficult to understand with axes that often bear no units or are quite compressed obscuring the intuitive meaning of the data trends. Fundamentally, cardiac pressure cycles with each heartbeat - roughly once per second - yet fluctuations in the depicted mean spike rate data with changes perhaps ten times in 25 minutes. Such plots are disorienting and difficult to associate with cardiac or neuron "functioning". Only 17 of the 38 references are not self-citations and thus the cited literature represents a narrow view of sympathetic regulation and sympathetic/stellate ganglion knowledge. Much of the foundations are self-professed in earlier publications by the present group and assumed to be accepted.”

      “Fundamentally, cardiac pressure cycles with each heartbeat - roughly once per second - yet fluctuations in the depicted mean spike rate data with changes perhaps ten times in 25 minutes. Such plots are disorienting and difficult to associate with cardiac or neuron "functioning”

      We would like to clarify this point with the understanding that the reviewer is referring to the time axis in Figure 3C in the manuscript.

      The coactivity matrix constructed in Figure 3C computes the cross correlation in sliding mean/std spike activities for different pairs of channels. The mean spiking activities across channels, as the reviewer correctly pointed out, do indeed have a weak autocorrelation with the period of the heart rate. The weak correlation for the heart rate period, possibly due to slow firing rates, was seen across all channels of both control and HF animals. But, the cause of a large proportion of channel-pairs exhibiting high coactivity, termed as cofluctuation (Shown as red tracings in Fig 3D), is not known and cannot be directly associated with cardiac functioning.

      The cofluctuation was also found to be aperiodic in nature approximating a lognormal distribution (Fig R1) with the HF animals containing heavy tails outside their confidence intervals (Fig R1B). The event rate computed from the cofluctuation time series (shown as blue steps in Fig 3E) for an animal is a measure of spatial coherence among SG neural populations and was developed as a novel metric to be used in future studies.

      Figure R1: Cofluctuation histograms (calculated from mean or standard deviation of sliding spike rate, referred as Cofluctuation_MEAN and Cofluctuation_STD, respectively) and log-normal fits for each animal group. μF IT and σF IT are the respective mean and standard deviation (STD) of fitted distribution, used for 68% confidence interval bounds. A-B: Control animals have narrower bounds and represent a better fit to log-normal distribution. C-D: Heart failure (HF) animals display more heavily skewed distributions that indicate heavy tails.

      “Only 17 of the 38 references are not self-citations and thus the cited literature represents a narrow view of sympathetic regulation and sympathetic/stellate ganglion knowledge. Much of the foundations are self-professed in earlier publications by the present group and assumed to be accepted.”

      We thank the reviewer for pointing this out. We have added four additional citations that include methods such as neural population bias and spatiotemporal dynamics linkages to control targets in the neuroscience literature. We have added these citations to page 15 in the “Conclusion” section of the manuscript. In addition, it is our group’s specialty to carry these cardiac nervous system experiments, we are not aware of another group collecting multi-electrode array data from the cardiac nervous system and studying population dynamics of cardiac neurons. Hence we build on based on our previous learnings. The most relevant literature (not necessarily related to cardiac nervous system) can be found in the neuroscience references we cited that contain applications of neural population recordings for different brain areas, mainly in neuropsychiatry domain to understand disease dynamics.

      “For the expert or even the uninformed reader, this report is broadly confused and confusing. The premises (beat to beat or whether LVP conveys cardiac function) are poorly supported. The conclusions are quite vague.”

      Thank you for your feedback. To simplify the understanding, we moved all mathematical details to supplementary material, re-wrote the abstract and the conclusion from scratch, and splitted the methods figures that may be confusion. We believe that our novel metrics event rate and entropy capture non-trivial linkages between heart failure status, cardiac neural activity (spike activity), and peripheral activity (LVP). We have supported our metrics with 17 animals with state-of-the-art surgical techniques and technology, and reported our results with detailed statistical analyses. Our manuscript essentially highlights that event rate and entropy metrics are significantly different between control animals and animals with heart failure. These metrics can be used to design future studies with these animal models to provide a more quantitative approach to heart disease, rather than binary (yes or no) descriptions.

      “Discussion: The abstract does not convey conclusions from the findings and contains broad statements such as "signatures based on linking neuronal population cofluctuation and examine differences in "neural specificity" of SG network" that have little substantive value or conclusion for the reader. Fundamentally what does the title "signatures based on linking neuronal population" cofluctuation mean to the reader? What changed in HF?”

      Thank you for this comment. We completely revised the abstract and conclusion as detailed in our response to Essential Revision #1. Event rate is a metric related to neural activity recordings and entropy is related to the association of neural activity to left ventricular blood pressure. Our findings suggest that both the neural population activity itself (event rate) and its ability to pay attention to cycles of left ventricular pressure (neural specificity) are significantly higher in animals with HF compared to controls.

    1. Author Response

      Reviewer #1 (Public Review):

      (1.1) The work by Porciello and colleagues provides scientific evidence that the acidic content of the stomach covaries with the experienced level of disgust and fear evoked by disgusting videos. The working of the inside of the gut during cognitive or emotional processes have remained elusive due to the invasiveness of the methods to study it. The major strength of the paper is the use of the non-invasive smart pill technology, which senses changes in Ph, pressure and temperature as it travels through the gut, allowing authors to investigate how different emotions induced with validated video clips modulate the state of the gut. The experimental paradigm used to evoke distinct emotions was also successful, as participants reported the expected emotions after each emotion block. While the reported evidence is correlational in nature, I believe these results open up new avenues for studying brain-body interactions during emotions in cognitive neuroscience, and future causal manipulations will shed more insight on this phenomena. Indeed, this is the first study to provide evidence for a link between gastric acidity and emotional experience beyond single patient studies, and it has major implications for the advancement of our understanding of disorders with psycho-somatic influences, such as stress and it's influence of gastritis.

      1.1 First of all, we want to thank Reviewer#1 for his cogent comments and for highlighting that our findings may inspire future research on brain-body interactions. We took into the highest consideration all the remarks and changed the manuscripts accordingly.

      (1.2) As for the limitations, little insight is provided on the mechanisms, time scales, and inter-individual variability of the link between gastric Ph and emotional induction. Since this is a novel phenomena, it would be important to further validate and characterize this finding. On this line, one of the most well known influences of disgust on the gut is tachygastria, the acceleration of the gastric rhythm. It would be important to understand how acid secretion by disgusting film is related to tachygastria, but authors only examine the influence of disgusting film on the normogastric frequency range.

      1.2 We are aware that at the moment our data are mainly descriptive and do not provide a clear picture of the causal mechanisms. However, to deal with this outstanding issue we added a new series of analysis.

      Most of the data on gastric activity come from analysis of the normogastric band. However, information about the EGG tachygastric rhythm in humans is of potential great importance. To deal with the reviewer’s comment and considering the previously published literature, we re-examined the EGG data focusing on the tachygastric rhythm. The methodology remained consistent with the process described for normogastric peak extraction but this time, we extracted the peak in the tachygastric band, specifically 0.067 to 0.167 Hz (i.e., 4–10 cpm). The ANOVA performed over the tachygastric cycle revealed a significant main effect of the type of video clip (F(4, 112) = 2.907, p = 0.025, Eta2 (partial) = 0.09). However, the Bonferroni corrected post hoc tests did not show any significant difference between the different type of emotional video clips and the neutral condition. The sole significant comparison was observed between participants viewing happy and fearful video clips, indicating that participants’ tachygastric cycles were faster when exposed to happy rather than fearful video clips (p = 0.038). For a visual representation of the outcomes, please see Fig S6.

      We revised the main text (Page 17, lines: 472-482) to include this analysis. The revised text now reads as follows:

      “Finally, we explored whether normogastric and/or tachygastric cycle changed in response to specific emotional experience. After checking that normogastric and tachygastric peak frequencies were normally distributed (all ps > 0.05), we ran two separate ANOVAs on the individual peak frequencies in the normogastric and tachygastric range. Each analysis had the type of video clip as within-subjects factor. The ANOVA performed on the normogastric rhythm was not significant (F(4, 44) = 1.037, p = 0.399) suggesting that the gastric rhythm did not change while participants observed the different emotional video clips. In contrast, the ANOVA performed on the tachygastric rhythm did show a significant main effect (F(4, 112) = 2.907, p = 0.025, Eta2 (partial) = 0.09). However, the only comparison that survived the Bonferroni correction was the one between happy and fearful video clips, namely participants’ tachygastric cycle was faster when they observed happy vs fearful video clips (p = 0.038) see Fig. S6 for a graphical representation of the results.”

      To deal with the Reviewer’s comment, we also correlated the average pH value with the corresponding frequency of the tachygastric cycle recorded in the disgusting, happy and the fearful video clips, namely the emotions associated to changes in pH. The only significant correlation was the one found during the disgusting video clips (r= 0.435; p= 0.023, all the other rs ≤ 0.351, all the other ps ≥ 0.073). Differently from what we expected, we found a positive correlation suggesting that when participants were exposed to disgusting video clips the less acidic was the pH the higher was the frequency of the tachygastric cycle. Instead, we know from our pill data that disgusting video clips are associated to more acid values, and from literature (not replicated by us) to a faster gastric rhythm. Since we did not find strong support in the EGG analysis suggesting a relationship between the gastric rhythm and the emotional experience, we believe that additional evidence will help to clarify the relationship between pH and gastric rhythm.

      (1.3) Additionally, only one channel of the electrogastrogram (EGG) was used to measure the gastric rhythm, and no information is provided on the quality of the recordings. With only one channel of EGG, it is often impossible to identify the gastric rhythm as the position of the stomach varies from person to person, yielding inaccurate estimates of the frequency of the gastric rhythm.

      1.3 We agree with Reviewer 1 on this point. We acknowledge the potential limitation associated with one-channel EGG recording in our study. To deal with this remark, in a separate (ongoing) study (N# participants= 25) we recorded the electrogastrogram following the methodology outlined by Wolpert et al., 2020 published on Psychophysiology. Thus, in order to study the EGG in association to the emotional experience, we used a bipolar 4-channels montage while participants observed the same emotional video clips used in our current study (see picture below for the montage set-up).

      Author response image 1 shows the 4-channels EGG bipolar recording montage reproducing the one proposed by Wolpert et al., 2020.

      Author response image 1.

      Then, we extracted the gastric cycle in both the normogastric and the tachygastric bands.

      After checking that data were normally distributed (Kolmogorov-Smirnov ds > 0.10; ps> .20), in the case of the gastric cycle extracted in the normogastric band, we ran a repeated measures ANOVA with the type of video clip as the only within-subjects factor measured on the 5 levels (i.e. the five types of video clip: Disgusting, Fearful, Happy, Neutral, and Sad). The ANOVA shows that the gastric cycle recorded during the different video clips did not differ (F (4,96) = 0.39; p= 0.81), see the plot on Author response image 2.

      Author response image 2.

      Gastric cycle (normogastric band) recorded via multiple-channels electrogastrogram (EGG) during the emotional experience. The plot shows the gastric cycle extracted in the normogastric band while participants were observing the five categories of the video clips (i.e. those inducing disgust, fear, happiness, sadness and, as control, a neutral state).

      We also extracted the gastric cycle in the tachygastric band, the distribution of the data was not normal in one condition (Kolmogorov-Smirnov ds > 0.27; p < 0.05), therefore we ran a Friedman ANOVA to compare the gastric cycle during the different emotional experiences. The Friedman ANOVA was not statistically significant (χ2 (4) = 2.88; p = 0.58), suggesting that, similarly to the gastric cycle extracted in the normogastric band, also the one extracted in the tachygastric band was not clearly associated to the investigated emotional states, see Author response image 3.

      Author response image 3.

      Gastric cycle (tachygastric band) recorded via multiple-channels electrogastrogram (EGG) during the emotional experience. The plot above shows the gastric cycle extracted in the tachygastric band while participants were observing the five categories of the video clips (i.e. those inducing disgust, fear, happiness, sadness and as control a neutral state).

      Results from this control study seem to suggest that the non-significant effect of the gastric cycle was probably not due to the fact that we use a one-channel egg montage, at least for what concerns the gastric cycle extracted from the normogastric band.

      For what concerns the tachygastric frequency associated to the emotional experience these results from a multi-channel EGG recording seem to go in the same direction of the normogastric one, namely no frequency of the gastric cycles recorded during the emotional video clips was different from the control condition.

      The only significant difference that we found in our 1-channel EGG study was the one between the happy and the fearful video clips (see Fig. S6 contained in the supplementary materials and above). Specifically, we found that happy video clips were associated to higher gastric frequency compared to the fearful ones. However, we did not replicate these findings in our multi-channels EGG study.

      Although suggestive, this evidence is not conclusive. Indeed, we are aware that a final word on the results of our multi-channel study can be said only when a larger sample is obtained.

      (1.4) Finally, I believe that the results do not show evidence in favor of the discrete nature of emotions theory as they claim in the discussion. Authors chose to use stimuli inducing discrete emotions, and only asked subjective reports of these same discrete emotions, so these results shed no light on whether emotions are represented discretely vs continuously in the brain.

      We revised the discussion in order to better describe our results and toned down the interpretation that the present findings directly support the discrete nature of emotions, as suggested by this Reviewer.

      Now page 21&22 lines 622-631 reads as follow:

      “Overall, and in line with theoretical and empirical evidence (Damasio, 1999; Harrison et al., 2010; James, 1994, Lettieri et al., 2019; Stephens et al., 2010), our findings may suggest that specific patterns of subjective, behavioural, and physiological measures are linked to unique emotional states...We acknowledge that our results, although novel, are restricted to a sample of male participants, and more importantly they need to be replicated. We also acknowledge that future studies should better investigate the mechanisms underlying the role of the pH in the emergence of specific emotion. For instance, pharmacologically manipulating stomach pH during emotional induction, not only for basic emotions but also for exploring complex emotions such as moral disgust (Rozin et al., 2009), would enable researchers to generalize these findings and examine the directionality of this relationship.”

      Reviewer #2 (Public Review):

      To measure the role of gastric state in emotion, the authors used an ingestible smart pill to measure pH, pressure, and temperature in the gastrointestinal tract (stomach, small bowel, large bowel) while participants watched videos that induced disgust, fear, happiness, sadness, or a control (neutral). The study has a number of strengths, including the novelty of the measurement (very few studies have ever measured these gut properties during emotion processing) and the apparent robustness of their main finding (that during disgusting video clips, participants who experienced more feelings of disgust (and to a lesser degree which might not survive more stringent multiple comparison correction, fear) had more acidic stomach measurements, while participants who experienced more happiness during the disgusting video clips had a less acidic (more basic) stomach pH. Although the study is correlational (which all discussion should carefully reflect) and is restricted to a moderately-sized, homogenous sample, the results support their general conclusion that stomach pH is related to emotion experience during disgust induction. There may be additional analyses to conduct in order for the authors to claim this effect is specific to the stomach. Nevertheless, this work is likely to have a large impact on the field, which currently tends to rely on noninvasive measures of gastric activity such as electrogastrography (which the authors also collect for comparison); the authors' minimally-invasive approach yields new and useful measurements of gastric state. These new measures could have relevance beyond emotion processing in understanding the role of gut pH (and perhaps temperature and pressure) in cognitive processes (e.g. interoception) as well as mental and physical health.

      We are very grateful to Reviewer#2 for skilfully managing the paper and highlighting its strengths, particularly the innovative measurement approach and the potential implications these findings might offer for future research into the impact of gastric signals on emotional experiences and potentially on many other higher-order cognitive functions. Additionally, we would like to thank her for the highly valuable feedback. We have incorporated all the comments into the revised manuscript, aiming to enhance its quality.

      Reviewer #3 (Public Review):

      This study used novel ingestible pills to measure pH and other gastric signals, and related these measures to self-report ratings of emotions induced by video clips. The main finding was that when participants viewed videos of disgust, there was an association between gastric pH and feelings of disgust and fear, and (in the opposite direction) happiness. These findings may be the first to relate objective measures of gastric physiology to emotional experience. The methods open up many new questions that can be addressed by future studies and are thus likely to have an impact on the field.

      We thank very much also Reviewer#3 for the accurate reading of our manuscript; for highlighting the strengths of our study; and for providing valuable feedback. Below, a point-by-point response to all the comments raised by this Reviewer. We have incorporated their comments, and we hope they are satisfied by the new version of the manuscript.

      (3.1) My main concern is with the reliability of the results. The study associates many measures (pH, temperature, pressure, EGG) in stomach, small bowel, and large bowel with multiple emotion ratings. This amounts to many statistical tests. Only one of these measures (pH in the stomach) shows a significant effect. Furthermore, the key findings, as displayed in Figure 4 do not look particularly convincing. Perhaps this is a display issue, but the relations between stomach pH and Vas ratings of disgust, fear, and happiness were not apparent from the scatter plot and may be influenced by outliers (e.g., happiness).

      3.1 We thank Reviewer#3 for raising this issue which was also raised by Reviewer#1 and #2, se replies above. As reported above we worked on the data analysis in order to provide more evidence supporting our claim, i.e. that pH plays a role in the emotional experience of disgust, happiness and fear. We modified Figure 4 (now 5) as also requested by Reviewer 1 and 2, and we now hope that it is clearer. We included a new analysis, in which we used all the datapoints recorded from the ingestible device and we performed a mixed models analysis with pH as dependent variable, type of video clips and number of datapoints (‘Time’) as fixed factors, and the by-subject intercepts as random effects. This analysis not only supported the results of the original one but provided evidence for a causal role of the emotional induction on the pH of the stomach. Results of this analysis are described in point 1.7 in the response to Reviewer#1 and results of the new analysis and the revised version of the main figure can be found in track change in the manuscript (Page 15&16, lines: 408-439) in the main text and copied and pasted below.

      “To explore how the emotional induction could modulate the pH of the stomach and how the length of the exposure to that specific emotional induction could also play a role in modulating pH variations, we ran an additional model, Model 2. This model included all the pH datapoints registered using the Smartpill as dependent variable, the type of video clip and the number of the datapoints (“Time”) as fixed effects, and the by-subject intercepts as random effects (see Supplementary information for a detailed description of the model). Model 2 had a marginal R2 = 0.014 and a conditional R2 = 0.79. Visual inspection of the plots did reveal some small deviations from homoscedasticity, visual inspection of the residuals did not show important deviations from normality. As for collinearity (tested by means of vif function of car package), all independent variables had a GVIF^(1/(2*Df)))^2 < 10.

      Type III analysis of variance of Model 2 showed a statistically significant main effect of the Time (F = 20.237, p < 0.001, Eta2 < 0.01) suggesting that independently from the type of video clip observed, the stomach pH significantly decreased as a function of the time of exposure to the induction. A significant main effect of the type of video clip was also found (F = 22.242, p < 0.001, Eta2 = 0.01) suggesting that pH of the stomach changes when participants experienced different types of emotions. In particular, post hoc analysis revealed that pH was more acidic when participants observed disgusting compared to fearful (t= -11.417; p < 0.001), happy (t= -15.510; p < 0.001) and neutral (t= -3.598; p = 0.003) video clips.

      Also, pH was more acidic when participants observed fearful compared to happy (t= -4.064; p < 0.001), and less acidic compared to neutral (t= 7.835; p < 0.001) and sad scenarios (t= 9.743; p < 0.001). Finally, pH was less acidic when participants observed happy compared to neutral (t= 11.923; p < 0.001). and sad videoclips (t= 13.806; p < 0.001), see Fig.6, left panel. Interestingly, also the double interaction Time X Type of video clip was significant (F = 3.250, p = 0.0113, Eta2 < 0.01) suggesting that the time of the exposure to the induction differentially influenced the pH of the stomach depending on to the type of the observed video clip. Simple slope analysis showed that while pH did not change over time when observing disgusting (t= -1.2691; p = 0.2045) and happy (t= 0.4466; p = 0.6552) clips, it did significantly decrease over time when observing fearful (t= -4.4212; p < 0.001), sad (t= -2.0487; p = 0.0405) and neutral video clips (t= -2.7956; p = 0.0052), see Fig.6, right panel."

      We believe that the new evidence reported provides support of our claims and we hope that the reviewer agrees with us. However, as we also mentioned in the paper, we are aware that replications are needed and we are already working on this.

    1. Author Response:

      Reviewer #1:

      The largest concern with the manuscript is its use of resting-state recordings in Parkinson's Disease patients on and off levodopa, which the authors interpret as indicative of changes in dopamine levels in the brain but not indicative of altered movement and other neural functions. For example, when patients are off medication, their UPDRS scores are elevated, indicating they likely have spontaneous movements or motor abnormalities that will likely produce changed activations in MEG and LFP during "rest". Authors must address whether it is possible to study a true "resting state" in unmedicated patients with severe PD. At minimum this concern must be discussed in the manuscript.

      We agree that Parkinson’s disease can lead to unwanted movements such as tremor as well as hyperkinesias. This would of course be a deviation from a resting state in healthy subjects. However, such movements are part of the disease and occur unwillingly. The main tremor in Parkinson’s disease is a rest tremor and - as the name already suggests – it occurs while not doing anything. Therefore, such movements can arguably be considered part of the resting state of Parkinson’s disease. Resting state activity with and without medication is therefore still representative for changes in brain activity in Parkinson’s patients and indicative of alterations due to medication.

      To further investigate the effect of movement in our patients, we subdivided the UPDRS part 3 score into tremor and non-tremor subscores. For the tremor subscore we took the mean of item 15 and 17 of the UPDRS, whereas for the non-tremor subscore items 1, 2, 3, 9, 10, 12, 13, and 14 were averaged. Following Spiegel et al., 2007, we classified patients as akinetic-rigid (non-tremor score at least twice the tremor score), tremor-dominant (tremor score at least twice as large as the non-tremor score), and mixed type (for the remaining scores). Of the 17 patients, 1 was tremor dominant and 1 was classified as mixed type (his/her non-tremor score was greater than tremor score). None of our patients exhibited hyperkinesias during the recording. To exclude that our results are driven by tremor-related movement, we re-ran the HMM without the tremor-dominant and the mixed-type patient (see Figure R1 response letter).

      ON medication results for all HMM states remained the same. OFF medication results for the Ctx-Ctx and STN-STN state remained the same as well. The Ctx-STN state OFF medication was split into two states: Sensorimotor-STN connectivity was captured in one state and all other types of Ctx-STN connections were captured in another state (see Figure 1 response letter. The important point is that the biological conclusions stand across these solutions. Regardless, both with and without the two subjects a stable covariance matrix entailing sensorimotor-STN connectivity was determined, which is the main finding for the Ctx-STN state OFF medication.

      We therefore discuss this issue now within the limitation section (page 20):

      “Both motor impairment and motor improvement can cause movement during the resting state in PD. While such movement is a deviation from a resting state in healthy subjects, such movements are part of the disease and occur unwillingly. Therefore, such movements can arguably be considered part of the resting state of Parkinson’s disease. None of the patients in our cohort experienced hyperkinesia during the recording. All patients except for two were of the akinetic-rigid subtype. We verified that tremor movement is not driving our results. Recalculating the HMM states without these 2 subjects, even though it slightly changed some particular aspects of the HMM solution did not materially affect the conclusions.”

      Figure R1: States obtained after removing one tremor dominant and one mixed type patient from analysis. Panel C shows the split OFF medication cortico-STN state. Most of the cortico-STN connectivity is captured by the state shown in the top row (Figure 1 C OFF). Only the motor-STN connectivity in the alpha and beta band (along with a medial frontal-STN connection in the alpha band) is captured separately by the states labeled “OFF Split” (Figure 1 C OFF SPLIT).

      This reviewer was unclear on why increased "communication" in the medial OFC in delta and theta was interpreted as a pathological state indicating deteriorated frontal executive function. Given that the authors provide no evidence of poor executive function in the patients studied, the authors must at least provide evidence from other studies linking this feature with impaired executive function.

      If we understand the comment correctly it refers to the statement in the abstract “Dopaminergic medication led to communication within the medial and orbitofrontal cortex in the delta/theta frequency range. This is in line with deteriorated frontal executive functioning as a side effect of dopamine treatment in Parkinson’s disease”

      This statement is based on the dopamine overdose hypothesis reported in the Parkinson’s disease (PD) literature (Cools 2001; Kelly et al. 2009; MacDonald and Monchi 2011; Vaillancourt et al. 2013). We have elaborated upon the dopamine overdose hypothesis in the discussion on page 16. In short, dopaminergic neurons are primarily lost from the substantia nigra in PD, which causes a higher dopamine depletion in the dorsal striatal circuitry than within the ventral striatal circuits (Kelly et al. 2009; MacDonald and Monchi 2011). Thus, dopaminergic medication to treat the PD motor symptoms leads to increased dopamine levels in the ventral striatal circuits including frontal cortical activity, which can potentially explain the cognitive deficits observed in PD (Shohamy et al. 2005; George et al. 2013). We adjusted the abstract to read:

      “Dopaminergic medication led to coherence within the medial and orbitofrontal cortex in the delta/theta frequency range. This is in line with known side effects of dopamine treatment such as deteriorated executive functions in Parkinson’s disease.”

      In this article, authors repeatedly state their method allows them to delineate between pathological and physiological connectivity, but they don't explain how dynamical systems and discrete-state stochasticity support that goal.

      To recapitulate, the HMM divides a continuous time series into discrete states. Each state is a time-delay embedded covariance matrix reflecting the underlying connectivity between brain regions as well as the specific temporal dynamics in the data when such state is active. See Packard et al., (1980) for details about how a time-delay embedding characterises a linear dynamical system.

      Please note that the HMM was used as a data-driven, descriptive approach without explicitly assuming any a-priori relationship with pathological or physiological states. The relation between biology and the HMM states, thus, purely emerged from the data; i.e. is empirical. What we claim in this work is simply that the features captured by the HMM hold some relation with the physiology even though the estimation of the HMM was completely unsupervised (i.e. blind to the studied conditions). We have added this point also to the limitations of the study on page 19 and the following to the introduction to guide the reader more intuitively (page 4):

      “To allow the system to dynamically evolve, we use time delay embedding. Theoretically, delay embedding can reveal the state space of the underlying dynamical system (Packard et al., 1980). Thus, by delay-embedding PD time series OFF and ON medication we uncover the differential effects of a neurotransmitter such as dopamine on underlying whole brain connectivity.”

      Reviewer #2:

      Sharma et al. investigated the effect of dopaminergic medication on brain networks in patients with Parkinson's disease combining local field potential recordings from the subthalamic nucleus and magnetencephalography during rest. They aim to characterize both physiological and pathological spectral connectivity.

      They identified three networks, or brain states, that are differentially affected by medication. Under medication, the first state (termed hyperdopaminergic state) is characterized by increased connectivity of frontal areas, supposedly responsible for deteriorated frontal executive function as a side effect of medical treatment. In the second state (communication state), dopaminergic treatment largely disrupts cortico-STN connectivity, leaving only selected pathways communicating. This is in line with current models that propose that alleviation of motor symptoms relates to the disruption of pathological pathways. The local state, characterized by STN-STN oscillatory activities, is less affected by dopaminergic treatment.

      The authors utilize sophisticated methods with the potential to uncover the dynamics of activities within different brain network, which opens the avenue to investigate how the brain switches between different states, and how these states are characterized in terms of spectral, local, and temporal properties. The conclusions of this paper are mostly well supported by data, but some aspects, mainly about the presentation of the results, remain:

      We would like to thank the reviewer for his succinct and clear understanding of our work.

      1) The presentation of the results is suboptimal and needs improvement to increase readers' comprehension. At some points this section seems rather unstructured, some results are presented multiple times, and some passages already include points rather suitable for the discussion, which adds too much information for the results section.

      We have removed repetitions in the results sections and removed the rather lengthy introductory parts of each subsection. Moreover, we have now moved all parts, which were already an interpretation of our findings to the discussion.

      2) It is intriguing that the hyperdopaminergic state is not only identified under medication but also in the off-state. This is intriguing, especially with the results on the temporal properties of states showing that the time of the hyperdopaminergic state is unaffected by medication. When such a state can be identified even in the absence of levodopa, is it really optimal to call it "hyperdopaminergic"? Do the results not rather suggest that the identified network is active both off and on medication, while during the latter state its' activities are modulated in a way that could relate to side effects?

      The reviewer’s interpretations of the results pertaining to the hyper-dopaminergic state are correct. The states had been named post-hoc as explained in the results section. The hyper-dopaminergic state’s name derived from it showing the overdosing effects of dopamine. Of course, these results are only visible on medication. But off medication, this state also exists without exhibiting the effects of excess dopamine. To avoid confusion or misinterpretation of the findings and also following the relevant comment by reviewer 1, we renamed all states to be more descriptive:

      Hyperdopaminergic > Cortico-cortical state

      Communication > Cortico-STN state

      Local > STN-STN state.

      3) Some conclusions need to be improved/more elaborated. For example, the coherence of bilateral STN-STN did not change between medication off and on the state. Yet it is argued that a) "Since synchrony limits information transfer (Cruz et al. 2009; Cagnan, Duff, and Brown 2015; Holt et al. 2019) , local oscillations are a potential mechanism to prevent excessive communication with the cortex" (line 436) and b) "Another possibility is that a loss of cortical afferents causes local basal ganglia oscillations to become more pronounced" (line 438). Can these conclusions really be drawn if the local oscillations did not change in the first place?

      We apologize for the unclear description. Our conclusion was based on the following results:

      a) We state that STN-STN connectivity as measured by the magnitude of STN-STN coherence does not change OFF vs ON medication in the Cortico-STN state. This result is obtained using inter-medication analysis.

      b) But ON medication, STN-STN coherence in the Cortico-STN state was significantly different from mean coherence within the ON condition. These results are obtained using intra-medication analysis.

      Based on this, we conclude that in the Cortico-STN state, although OFF vs ON medication the magnitude of STN-STN coherence was unchanged, the STN-STN coherence was significantly different from mean coherence in the ON medication condition. The emergence of synchronous STN-STN activity may limit information exchange between STN and cortex ON medication.

      An alternative explanation for these findings might be a mechanism preventing connectivity between cortex and the STN ON medication. This missing interaction between STN and cortex might cause STN-STN oscillations to increase compared to the mean coherence within the ON state. Unfortunately, we cannot test such causal influences with our analysis.

      We have added the following discussion to the manuscript on page 17 in order to improve the exposition:

      “Bilateral STN–STN coherence in the alpha and beta band did not change in the cortico-STN state ON versus OFF medication (InterMed analysis). However, STN-STN coherence was significantly higher than the mean level ON medication (IntraMed analysis). Since synchrony limits information transfer (Cruz et al. 2009; Cagnan, Duff, and Brown 2015; Holt et al. 2019), the high coherence within the STN ON medication could prevent communication with the cortex. A different explanation would be that a loss of cortical afferents leads to increased local STN coherence. The causal nature of the cortico-basal ganglia interaction is an endeavour for future research.”

      Reviewer #3:

      In PD, pathological neuronal activity along the cortico-basal ganglia network notably consists in the emergence of abnormal synchronized oscillatory activity. Nevertheless, synchronous oscillatory activity is not necessarily pathological and also serve crucial cognitive functions in the brain. Moreover, the effect of dopaminergic medication on oscillatory network connectivity occurring in PD are still poorly understood. To clarify these issues, Sharma and colleagues simultaneously-recorded MEG-STN LFP signals in PD patients and characterized the effect of dopamine (ON and OFF dopaminergic medication) on oscillatory whole-brain networks (including the STN) in a time-resolved manner. Here, they identified three physiologically interpretable spectral connectivity patterns and found that cortico-cortical, cortico-STN, and STN-STN networks were differentially modulated by dopaminergic medication.

      Strengths:

      1) Both the methodological and experimental approaches used are thoughtful and rigorous.

      a) The use of an innovative data-driven machine learning approach (by employing a hidden Markov model), rather than hand-crafted analyses, to identify physiologically interpretable spectral connectivity patterns (i.e., distinct networks/states) is undeniably an added value. In doing so, the results are not biased by the human expertise and subjectivity, which make them even more solid.

      b) So far, the recurrent oscillatory patterns of transient network connectivity within and between the cortex and the STN reported in PD was evaluated/assessed to specific cortico-STN spectral connectivity. Conversely, whole-brain MEG studies in PD patients did not account for cortico-STN and STN-STN connectivity. Here, the authors studied, for the first time, the whole-brain connectivity including the STN (whole brain-STN approach) and therefore provide new evidence of the brain connectivity reported in PD, as well as new information regarding the effect of dopaminergic medication on the recurrent oscillatory patterns of transient network connectivity within and between the cortex and the STN reported in PD.

      2) Studying the temporal properties of the recurrent oscillatory patterns of transient network connectivity both ON and OFF medication is extremely important and provide interesting and crucial information in order to delineated pathological versus physiologically-relevant spectral brain connectivity in PD.

      We would like to thank the reviewer for their valuable feedback and correct interpretation of our manuscript.

      Weaknesses:

      1) In this study, the authors implied that the ON dopaminergic medication state correspond to a physiological state. However, as correctly mentioned in the limitations of the study, they did not have (for obvious reasons) a control/healthy group. Moreover, no one can exclude the emergence of compensatory and/or plasticity mechanisms in the brain of the PD patients related to the duration of the disease and/or the history of the chronic dopamine-replacement therapy (DRT). Duration of the disease and DRT history should be therefore considered when characterizing the recurrent oscillatory patterns of transient network connectivity within and between the cortex and the STN reported in PD, as well as when examining the effect of the dopaminergic medication on the functioning of these specific networks.

      We would like to thank the reviewer for pointing this out. We regressed duration of disease (year of measurement – year of onset) on the temporal properties of the HMM states. We found no relationship between any of the temporal properties and disease duration. Similarly, we regressed levodopa equivalent dosage for each subject on the temporal properties and found no relationship. We now discuss this point in the manuscript (page 20):

      “A further potential influencing factor might be the disease duration and the amount of dopamine patients are receiving. Both factors were not significantly related to the temporal properties of the states.”

      2) Here, the authors recorded LFPs in the STN activity. LFP represents sub-threshold (e.g., synaptic input) activity at best (Buzsaki et al., 2012; Logothetis, 2003). Recent studies demonstrated that mono-polar, but also bi-polar, BG LFPs are largely contaminated by volume conductance of cortical electroencephalogram (EEG) activity even when re-referenced (Lalla et al., 2017; Marmor et al., 2017). Therefore, it is likely that STN LFPs do not accurately reflect local cellular activity. In this study, the authors examined and measured coherence between cortical areas and STN. However, they cannot guarantee that STN signals were not contaminated by volume conducted signals from the cortex.

      We appreciate this concern and thank the reviewer for bringing it up. Marmor et al. (2017) investigated this on humans and is therefore most closely related to our research. They find that re-referenced STN recordings are not contaminated by cortical signals. Furthermore, the data in Lalla et al. (2017) is based on recordings in rats, making a direct transfer to human STN recordings problematic due to the different brain sizes. Since we re-referenced our LFP signals as recommended in the Marmor paper, we think that contamination due to cortical signals is relatively minor; see Litvak et al. (2011), Hirschmann et al. (2013), and Neumann et al. (2016) for additional references supporting this. That being said, we now discuss this potential issue in the paper on page 20.

      “Lastly, we recorded LFPs from within the STN –an established recording procedure during the implantation of DBS electrodes in various neurological and psychiatric diseases. Although for Parkinson patients results on beta and tremor activity within the STN have been reproduced by different groups (Reck et al. 2010, Litvak et al. 2011, Florin et al. 2013, Hirschmann et al. 2013, Neumann et al. 2016), it is still not fully clear whether these LFP signals are contaminated by volume-conducted cortical activity. However, while volume conduction seems to be a larger problem in rodents even after re-referencing the LFP signal (Lalla et al. 2017), the same was not found in humans (Marmor et al. 2017).”

      3) The methods and data processing are rigorous but also very sophisticated which make the perception of the results in terms of oscillatory activity and neural synchronization difficult.

      To aid intuition on how to interpret the result in light of the methods used, one can compare the analysis pipeline to a windowing approach. In a more standard approach, windows of different time length can be defined for different epochs within the time series and for each window coherence and connectivity can be determined. The difference in our approach is that we used an unsupervised learning algorithm to select windows of varying length based on recurring patterns of whole brain network activity. Within those defined windows we then determine the oscillatory properties via coherence and power – which is the same as one would do in a classical analysis. We have added an explanation of the concept of “oscillatory activity” within our framework to the introduction (page 2 footnote):

      “For the purpose of our paper, we refer to oscillatory activity or oscillations as recurrent, but transient frequency–specific patterns of network activity, even though the underlying patterns can be composed of either sustained rhythmic activity, neural bursting, or both (Quinn et al. 2019).”

      Moreover, we provide a more intuitive explanation of the analysis within the first section of the results (page 4):

      “Using an HMM, we identified recurrent patterns of transient network connectivity between the cortex and the STN, which we henceforth refer to as an ‘HMM state’. In comparison to classic sliding-window analysis, an HMM solution can be thought of as a data-driven estimation of time windows of variable length (within which a particular HMM state was active): once we know the time windows when a particular state is active, we compute coherence between different pairs of regions for each of these recurrent states.”

      4) Previous studies have shown that abnormal oscillations within the STN of PD patients are limited to its dorsolateral/motor region, thus dividing the STN into a dorsolateral oscillatory/motor region and ventromedial non-oscillatory/non-motor region (Kuhn et al. 2005; Moran et al. 2008; Zaidel et al. 2009, 2010; Seifreid et al. 2012; Lourens et al. 2013, Deffains et al., 2014). However, the authors do not provide clear information about the location of the LFP recordings within the STN.

      We selected the electrode contacts based on intraoperative microelectrode recordings (for details, see page 23). The first directional recording height after the entry into the STN was selected to obtain the three directional LFP recordings from the respective hemisphere. This practice has been proven to improve target location (Kochanski et al., 2019; Krauss et al., 2021). The common target area for DBS surgery is the dorsolateral STN. To confirm that the electrodes were actually located within this part of the STN, we now reconstructed the DBS location with Lead-DBS (Horn et al. 2019). All electrodes – except for one – were located within the dorsolateral STN (see figure 7 of the manuscript). To exclude that our results were driven by outlier, we reanalysed our data without this patient. No change in the overall connectivity pattern was observed (see figure R3 of the response letter).

      Figure R2: Lead DBS reconstruction of the location of electrodes in the STN for different subjects. The red electrodes have not been placed properly in the STN. The contacts marked in red represent the directional contacts from which the data was used for analysis.

      Figure R3: HMM states obtained after running the analysis without the subject with the electrode outside the STN.

      References:

      Buzsáki G, Anastassiou CA, Koch C. The origin of extracellular fields and currents-EEG, ECoG, LFP and spikes. Nat Rev Neurosci 2012; 13: 407–20.

      Cagnan H, Duff EP, Brown P. The relative phases of basal ganglia activities dynamically shape effective connectivity in Parkinson’s disease. Brain 2015; 138: 1667–78.

      Cools R. Enhanced or impaired cognitive function in Parkinson’s disease as a function of dopaminergic medication and task demands. Cereb Cortex 2001; 11: 1136–43.

      Cruz A V., Mallet N, Magill PJ, Brown P, Averbeck BB. Effects of dopamine depletion on network entropy in the external globus pallidus. J Neurophysiol 2009; 102: 1092–102.

      Florin E, Erasmi R, Reck C, Maarouf M, Schnitzler A, Fink GR, et al. Does increased gamma activity in patients suffering from Parkinson’s disease counteract the movement inhibiting beta activity? Neuroscience 2013; 237: 42–50.

      George JS, Strunk J, Mak-Mccully R, Houser M, Poizner H, Aron AR. Dopaminergic therapy in Parkinson’s disease decreases cortical beta band coherence in the resting state and increases cortical beta band power during executive control. NeuroImage Clin 2013; 3: 261–70.

      Hirschmann J, Özkurt TE, Butz M, Homburger M, Elben S, Hartmann CJ, et al. Differential modulation of STN-cortical and cortico-muscular coherence by movement and levodopa in Parkinson’s disease. Neuroimage 2013; 68: 203–13.

      Holt AB, Kormann E, Gulberti A, Pötter-Nerger M, McNamara CG, Cagnan H, et al. Phase-dependent suppression of beta oscillations in parkinson’s disease patients. J Neurosci 2019; 39: 1119–34.

      Horn A, Li N, Dembek TA, Kappel A, Boulay C, Ewert S, et al. Lead-DBS v2: Towards a comprehensive pipeline for deep brain stimulation imaging. Neuroimage 2019; 184: 293–316.

      Kelly C, De Zubicaray G, Di Martino A, Copland DA, Reiss PT, Klein DF, et al. L-dopa modulates functional connectivity in striatal cognitive and motor networks: A double-blind placebo-controlled study. J Neurosci 2009; 29: 7364–78.

      Kochanski RB, Bus S, Brahimaj B, Borghei A, Kraimer KL, Keppetipola KM, et al. The impact of microelectrode recording on lead location in deep brain stimulation for the treatment of movement disorders. World Neurosurg 2019; 132: e487–95.

      Krauss P, Oertel MF, Baumann-Vogel H, Imbach L, Baumann CR, Sarnthein J, et al. Intraoperative neurophysiologic assessment in deep brain stimulation surgery and its impact on lead placement. J Neurol Surgery, Part A Cent Eur Neurosurg 2021; 82: 18–26.

      Lalla L, Rueda Orozco PE, Jurado-Parras MT, Brovelli A, Robbe D. Local or not local: Investigating the nature of striatal theta oscillations in behaving rats. eNeuro 2017; 4: 128–45.

      Litvak V, Jha A, Eusebio A, Oostenveld R, Foltynie T, Limousin P, et al. Resting oscillatory cortico-subthalamic connectivity in patients with Parkinson’s disease. Brain 2011; 134: 359–74.

      MacDonald PA, MacDonald AA, Seergobin KN, Tamjeedi R, Ganjavi H, Provost JS, et al. The effect of dopamine therapy on ventral and dorsal striatum-mediated cognition in Parkinson’s disease: Support from functional MRI. Brain 2011; 134: 1447–63.

      MacDonald PA, Monchi O. Differential effects of dopaminergic therapies on dorsal and ventral striatum in Parkinson’s disease: Implications for cognitive function. Parkinsons Dis 2011; 2011: 1–18.

      Marmor O, Valsky D, Joshua M, Bick AS, Arkadir D, Tamir I, et al. Local vs. volume conductance activity of field potentials in the human subthalamic nucleus. J Neurophysiol 2017; 117: 2140–51.

      Neumann WJ, Degen K, Schneider GH, Brücke C, Huebl J, Brown P, et al. Subthalamic synchronized oscillatory activity correlates with motor impairment in patients with Parkinson’s disease. Mov Disord 2016; 31: 1748–51.

      Packard NH, Crutchfield JP, Farmer JD, Shaw RS. Geometry from a time series. Phys Rev Lett 1980; 45: 712–6.

      Quinn AJ, van Ede F, Brookes MJ, Heideman SG, Nowak M, Seedat ZA, et al. Unpacking Transient Event Dynamics in Electrophysiological Power Spectra. Brain Topogr 2019; 32: 1020–34.

      Reck C, Himmel M, Florin E, Maarouf M, Sturm V, Wojtecki L, et al. Coherence analysis of local field potentials in the subthalamic nucleus: Differences in parkinsonian rest and postural tremor. Eur J Neurosci 2010; 32: 1202–14.

      Shohamy D, Myers CE, Grossman S, Sage J, Gluck MA. The role of dopamine in cognitive sequence learning: Evidence from Parkinson’s disease. Behav Brain Res 2005; 156: 191–9.

      Spiegel J, Hellwig D, Samnick S, Jost W, Möllers MO, Fassbender K, et al. Striatal FP-CIT uptake differs in the subtypes of early Parkinson’s disease. J Neural Transm 2007; 114: 331–5.

      Vaillancourt DE, Schonfeld D, Kwak Y, Bohnen NI, Seidler R. Dopamine overdose hypothesis: Evidence and clinical implications. Mov Disord 2013; 28: 1920–9.

    1. Author Response

      Reviewer #1 (Public Review):

      This study provided evidence to interpret and understand the aging and developmental processes in children. The main strength of the study is it measures a set of biological age measures and a set of developmental measures, thus providing multi-faceted evidence to explain the associations between aging and development in children. The main weakness of this study is that how to measure and test the aging hypothesis of "a buildup of biological capital model" and "wear and tear" is not well-explained. Why the observed associations between biological age measures and developmental measures could support the aforementioned aging theories?

      Thank you. On reflection we agree that how to test the aging hypotheses of "a buildup of biological capital model" and "wear and tear" is not well-explained in the manuscript. We have addressed this issue in the point-by-point responses below:

      1) Abstract - conclusion: The aging hypothesis of "a buildup of biological capital model" and "wear and tear" were mentioned in the conclusion without an explanation of these theories in the previous section. Readers who are not experts in the field may not understand the logic.

      We have replaced these phrases in the abstract with the following interpretation, which we hope will be more readily understood:

      “Patterns of associations suggested that accelerated immunometabolic age may be beneficial for some aspects of child development while accelerated DNA methylation age and telomere attrition may reflect early detrimental aspects of biological ageing, apparent even in children.”

      2) Result - Biological age marker performance: the correlation between transcriptome age and chronological age is very strong (r =0.94). I am afraid that very little age-independent information could be captured by the transcriptome age. Is it possible to down-regulate the age dependency of the transcriptome age in the training process?

      Thank you for this important comment: We agree the high accuracy of this clock may in fact reduce its relevance as a biological age marker and note that this is a concern generally in the field. We have explored the possibility of using a less accurate transcriptome age model as follows: Instead of elastic net modelling we tested using the lasso penalisation only, which will result in more parsimonious (sparse) models as less important features are dropped as the strength of the lambda parameter is increased. Plotting the correlation in the test set against number of features in models, as the lambda is sequentially increased, we can see (as shown in Author response image 1 by the blue line) that after the inclusion of around 200 features, the gain in accuracy becomes less steep.

      Author response image 1.

      We then tested the sensitivity of a model optimised for sparsity at the expense of some prediction accuracy, selected based on visual inspection (blue line, r in test set =0.87, number of features= 187) of the above plot, against developmental measures, compared to the most accurate model as presently included in the manuscript:

      Author response image 2.

      We find that, across all outcomes tested, the less accurate model, based on only the most important features, does not provide an improvement in sensitivity to developmental outcomes compared to the currently used model.

      We therefore prefer to keep the more accurate model in this study. Especially as it is consistent with the methodology used in the Horvath and Immunometabolic age models and generally in the field, and otherwise it is not obvious how the biological clock should be trained (especially for children without mortality data) without altering the whole approach of the study. We have acknowledged and discussed this issue on page 15.

      3) The study population comes from several cohorts, which might influence the results. How the cohort effects were controlled for in the analyses?

      The possible influence of cohort is a limitation of the study which we have discussed on page 16. We did not include cohort as a predictor in any of the candidate biological clocks since this may reduce detection of some age -related features. Instead, we include a variable for cohort as a fixed effect in all analyses with risk factors and developmental outcomes and examined the performance of candidate biological clocks in predicting chronological age within each cohort. As a further check, we have added an additional sensitivity analysis (Figure 4-figure supplement 6), against developmental outcomes significant in the main analysis, stratified by cohort. We find generally consistent effects across cohorts.

      4) Figure 3 only showed the number of p values. Can the author also provide the number of point estimates and 95% confidence intervals, perhaps in the supplemental table?

      This information was originally provided in supplemental table 5 (now Supplementary file 7), combined with the sensitivity analyses. To make this information easier to find, we have made this a stand-alone table (table 3). We now direct readers to this information within the caption of Figure 4 (previously figure 2).

      Reviewer #2 (Public Review):

      The study had an especially relevant aim for aging research and utilized various data types in an especially interesting human population. Multi-omics perspective adds great value to the work. The researchers aimed to evaluate how different indicators of biological age (BA) behave in children during their developmental stage. In the analysis, relationships between indicators of BA, health risk factors, and developmental factors were assessed in cross-sectional data comprising children aged 5-12 years. The manuscript is well-written and easy to follow. The methodology is good. The authors succeeded to reach the aim in most parts.

      In the study, previously known and unknown biological age indicators were used. Known indicators included telomere length and Horvath's epigenetic age. Unknown (novel) indicators, transcriptomic and immunometabolic clocks, were developed in the present study and they showed a strong correlation with calendar age in this population, also in the validation data set. Although the transcriptomic and immunometabolic clocks have the potential of being true indicators of biological age, they are still lacking scientific evidence of being such indicators in adults. That is, their associations with age-related diseases and mortality are yet to be shown. Thus, the major remark of the study relates to the phrasing: these novel transcriptomic and immunometabolic clocks should be presented as BA indicator candidates waiting for the needed evidence.

      Thank you for this important observation. However, we still find that “biological age indicator” is a useful umbrella term in this manuscript and there is not an obvious alternative. We therefore have added the following sentence on page 8, and highlighted the difference between the markers at key points in the abstract, introduction, results and discussion.

      “We note that since a common definition of markers of biological age is that they should be associated with age-related disease and mortality [69] these new clocks may only currently be considered “candidate” biological age markers. However, we have referred to both the established and candidate markers as biological age markers throughout to simplify presentation.”

    1. Author Response

      Reviewer #1 (Public Review)

      [...] One potential issue is that the high myelination signal is associated with the compartment in V2 (pale stripes) which was not functionally defined itself but by the absence of specific functional activations. No difference was reported between those stripes that were defined functionally. Other explanations for the differential pattern of a qMRI signals, e.g. ROI distribution for presumed pale stripes is not evenly distributed (more foveal), ROIs with low activations due to some other factor show higher myelin-related signals, cannot be excluded based on the analysis presented.

      Indeed, it would have been advantageous to directly functionally delineate pale stripes in V2. Since we were not able to achieve this by fMRI, we needed an indirect method to infer pale stripe contributions in the analysis. We also added a statement in the discussion section to emphasize this more (p. 9, lines 286–288).

      Furthermore, different myelination between thin and thick stripes was not tested, since we did not have a concrete hypothesis on this. Despite the conflicting findings of stronger myelination in dark or pale CO stripes in the literature, no histological study stated myelination differences between dark CO thin and thick stripes. Therefore, our primary interest and hypothesis was lying in comparing the different myelination of thin/thick and pale stripes using MRI.

      Thank you very much for this comment about potential other sources of differential qMRI parameter patterns. Indeed, based on the original analysis we could not exclude that the absence of functional activation around the foveal representation may have biased our analysis. We therefore added a supporting analysis, in which we excluded the region around the foveal representation from the analysis. The excluded cortical region was kept consistent between participants by excluding the same eccentricity range in all maps. We added more details in the results section of the revised manuscript (p. 8, lines 189–202). In Figure 5-Supplement 1 and Figure 5-Supplement 3, results from this supporting analysis are shown which reproduced the primary findings from the main analysis, particularly the relatively higher myelination of pale stripes.

      ROI definitions solely based on fMRI activation amplitude have additional limitations. However, we find it unlikely that a small fMRI effect size and low contrast-to-noise ratio (i.e. stochastic cause of low statistical parameter values/”activation”) has impacted the results, since Figure 3 shows that we could achieve a high degree of reproducibility for each participant.

      We would note that the fact that we found consistent differences across MPM and MP2RAGE sessions makes some potential artifacts driving the differences unlikely. We also find it unlikely that systematic cerebral blood volume differences between stripes would have driven the results. A higher local blood volume would lead to increased BOLD responses but also to a higher R1 value due to the deoxy-hemoglobin induced relaxation, which is opposite to the observation of higher activity in the thick/thin stripes but lower R1 values.

      Further studies using other functional metrics (e.g. VASO, ASL etc.) may help us to even more clearly demonstrate specificity but were out of the scope of this already rather extensive study. Although we have added extensive further analyses in the revised manuscript such as controlling for foveal effects or registration performance, we did not see a possibility to fully exclude a systematic bias that might potentially be caused by unknown factors.

      Another theoretical and practical issue is the question of "ground truth" for the non-invasive qMRI measures, as the authors - as their starting point - roundly dismiss direct histological tissue studies as conflicting, rather than take a critical look at the merit of the conflicting study results and provide a best hypothesis. If so, they need to explain better how they calibrate their non-invasive MR measurements of myelin.

      We agree and have now further elaborated on the limits of specificity of the R1 and R2* signal as cortical myelin marker (p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260). However, we still think that it is important for the reader to appreciate the conflicting results in histological studies using staining methods for myelin, which adds to the study’s background.

      We did not intend to give the impression that MRI provides the missing ground-truth to adjudicate histological controversies, but that it provides an alternative and additional view on the open questions. We changed the introduction to better reflect the aspect that the study offers a unique view by providing myelination proxies and functional measures in the same individual, which allows for direct comparison and investigation of structure-function relationships (see p. 2, lines 68–70; p. 3, lines 93–95), which is not accessible to any other approach. Nevertheless, we would like to note that R1 has been well established as a myelin marker under particular conditions (Kirilina et al., 2020; Mancini et al., 2020; Lazari and Lipp, 2021). It has also been widely used for cortical myelin mapping across a variety of populations, systems and field strengths. We added this statement to the introduction (see p. 2, lines 82-85). We note that we excluded volunteers with pathologies or neurological disorders from the study and their mean age was about 28 years. Thus, we had conditions comparable to previous (validation) studies.

      Because of the contradictory findings of histological studies, we could not further finesse the hypothesis beyond our previous a priori hypothesis that we expected differences in the myelin sensitive MRI metrics between the thin/thick versus pale stripes. To improve the contextual understanding, we added a paragraph in the discussion section covering in more depth how the MRI results relate to known histological findings (see pp. 8–9, lines 216–240).

      While this paper makes an important contribution to the question of the association of specific myelination patterns defining the columnar architecture in V2, it is not entirely clear whether the authors can fully resolve it with the data presented.

      Indeed, we agree that non invasive aggregate measures, such as the R1 metrics, offer limited specificity which precludes a fully conclusive inference about cortical myelination. We have further emphasized this on several occasions in the text (see p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260). Since the correspondence of cortical myelin levels and R1 (and other metrics) is an active area of research, we expect that the understanding, sensitivity and specificity of R1 to cortical myelination will further improve. We note that the use of qMRI is a substantial advance over weighted MRI typically used, which suffers from lack of specificity due to instrumental idiosyncrasies and varying measurement conditions.

      Reviewer #2 (Public Review)

      [...] Unfortunately, this particular study seems to fall into an unhappy middle ground in terms of the conclusions that can be drawn: the relaxometry measures lack the specificity to be considered "ground truth", while the authors claim that the literature lacks consensus regarding the structures that are being studied. The authors propose that their results resolve whether or not stripes differ in their patterns of myelination, but R1 lacks the specificity to do this. While myelin is a primary driver of relaxation times in cortex, relaxometry cannot be considered to be specific to myelin. It is possible that the small observed changes in R1 are driven by myelin, but they could also reflect other tissue constituents, particularly given the small observed effect sizes. If the literature was clear on the pattern of myelination across stripes, this study could confirm that R1 measurements are sensitive to and consistent with this pattern. But the authors present the work as resolving the question of how myelination differs between stripes, which over-reaches what is possible with this method. As it stands, the measured differences in R1 between functionally-defined cortical regions are interesting, but require further validation (e.g., using invasive myelin staining).

      We agree that we have inadvertently overstated the specificity of R1 at several occasions in the text. We therefore toned down the statements concerning the correspondence between R1 and myelin throughout the manuscript (e.g. see p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260).

      We also removed the phrase that gave the impression that MRI can conclusively resolve the conflicting results found in histological studies. In the Introduction, we changed the corresponding paragraph by emphasizing the alternative view, which can be obtained from MRI by the possibility to investigate structure-function relationships in the living human brain, which would not be possible by invasive myelin staining (see p. 2, lines 68–70; p. 3, lines 93–95).

      We acknowledge that – perhaps aside from electron microscopy – all common markers have shortcomings, which limit their specificity. For example, classic histology is not quantitative and resulted in conflicting results. It even includes the very fundamental issue, that the composition of myelin varies across the brain and within brain areas significantly (e.g., its lipid composition (González de San Román et al., 2018)). Thus, we regard the different invasive/non-invasive measures as complementary. R1 adds to this arsenal of measures and can be acquired non invasively. It has been shown to be a reliable myelin marker under certain circumstances. It follows the known myeloarchitecture patterns of the human brain, which was also checked for the data of the present study (see Figure 4 and Appendix 2). It is responsive to traumatic changes (Freund et al., 2019), development (Whitaker et al., 2016; Carey et al., 2018; Natu et al., 2019) and plasticity (Lazari et al., 2022). Since we studied healthy volunteers with no known pathologies that were sampled randomly from the population, we believe that the previous results generally apply and suggest sufficient specificity of the R1 marker. Of course, we cannot fully exclude bias due to unknown factors that have not been investigated/discovered by validation studies yet. However, in this case we expect that the systematic differences between stripe types would remain an important result most likely pointing to another interesting biological difference between stripes.

      While more research is needed to clarify the precise role of R1 for cortical myelin, we think that the meaningful determination of quantitative MR parameter within one cortical area is still interesting for the neuroscientific community.

      Moreover, the results make clear that R1 differences are not sufficiently strong to provide an independent measure of this structure (e.g., for segmentation of stripe). As such, one would still require fMRI to localise stripes, making it unclear what role R1 measures would play in future studies.

      Indeed, the observed small effect sizes in the present study still requires a functional localization with fMRI. We expected small effect sizes using R1 and R2* due to the known small inter-areal or intra-cortical differences of MRI myelin markers. Therefore, this study aimed at a proof-of-concept investigating whether intra-areal R1 differences at the spatial scale of columnar structures can be detected using non-invasive MRI. Our study shows that these differences can be seen but currently not at the single voxel level. We anticipate that with further improvements in sequence development and scanner hardware, high-resolution R1 estimates with sufficient SNR can be acquired making fMRI redundant (for this kind of investigations). Please see the reply to the next comment concerning the impact of using R1 in future studies.

      The Introduction concludes with the statement that "Whereas recent studies have explored cortical myelination ... using non-quantitative, weighted MR images... we showed for the first time myelination differences using MRI on a quantitative basis". As written, this sentence implies that others have demonstrated that simpler non-quantitative imaging can achieve the same aims as qMRI. Simply showing that a given method is able to achieve an aim would not be sufficient: the authors should demonstrate that this constitutes an important advance.

      Thank you for this comment. It goes to the heart of the concerns raised about specificity and sensitivity of MRI based myelin metrics. We elaborate here on the main advantage of using qMRI in our current study and why it is more specific than weighted MR imaging. However, we emphasize that a thorough comparison between qMRI and weighted MRI is highly complex and refer to our recent review paper on qMRI for further details (Weiskopf et al., 2021), which are beyond the scope of our paper. The signal in weighted MRI, even when optimally optimized to the tissue of interest, additionally depends on both inhomogeneities in the RF transmit and receive (bias) fields. Other methods like using a ratio image (T1w/T2w) can cancel out the receive field bias entirely (in the case of no subject movements between scans) but not the transmit field bias. This hampers the direct analysis and interpretation of signal differences between distant regions of the brain. For high resolution imaging applications, the usage of high magnetic fields such as 7 T is beneficial or even mandatory due to signal-to-noise (SNR) penalties. With increasing field strength, these inhomogeneities also apply to small regions as V2. For these cases, qMRI is advantageous since it provides metrics which are free from these technical biases, significantly improving the specificity. As high-field MRI has the potential to non invasively study the structure and function of the human brain at the spatial scale of cortical layers and cortical columns, we believe that the results of our current study, which successfully demonstrate the applicability of qMRI to robustly detect small differences at the level of columnar systems, is relevant for future studies in the field of neuroscience.

      We emphasized these considerations in the revised manuscript (see. p. 9, lines 273–285).

      The study includes a very small number of participants (n=4). The advantage of non-invasive in-vivo measurements, despite the fact that they are indirect measures, should be that one can study a reasonable number of subjects. So this low n seems to undermine that point. I rarely suggest additional data collection, but I do feel that a few more subjects would shore up the study's impact.

      The present study was conducted in line with a deep phenotyping study approach. That is, we focused on acquiring highly reliable datasets on individuals. We did not intend to capture the population variance, which is often the goal of other group studies, since low level and basic features such as stripes in V2 are expected to be present in all healthy individuals. Thus we traded off and prioritized test-retest measurements for fMRI sessions and using an alternative MP2RAGE acquisition over a larger number of individuals. This resulted in 6–7 scanning sessions on different days for each individual, summing up to 26 long scanning session in total. We also note that the used sample size is not smaller than in other studies with a similar research question. For example, another fMRI study investigating V2 stripes in humans used the same sample size of n=4 (Dumoulin et al., 2017).

      The paper overstates what can be concluded in a number of places. For example, the paper suggests that R1 and R2 are highly-specific to myelin in a number of places. For example, on p7 the text reads" "We tested whether different stripe types are differentially myelinated by comparing R1 and R2..." Relaxation times lack the specificity to definitively attribute these changes purely to myelin. Similarly, on p11: "Our study showed that pale stripes which exhibit lower oxidative metabolic activity according to staining with CO are stronger myelinated than surrounding gray matter in V2." This implies that the study directly links CO staining to myelination. In addition to using non-specific estimates of myelination, the study does not actually measure CO.

      We agree that we did not clearly point out the limitations of R1 myelin mapping. Therefore, we toned down the statements about the connection between cortical myelin and R1. The mentioned statements in the reviewer’s comment were changed accordingly (see p. 6, line 163; p. 11, lines 353–354). We also included a small paragraph to clarify the used terminology (color-selective thin stripes, disparity-selective thick stripes) in the manuscript (see p. 4, lines 110–114) to avoid the inadvertent conflation of CO staining and actually measured brain activity.

      I'm confused by the analysis in Figure 5. I can appreciate why the authors are keen to present a "tripartite" analysis (thick, thin, and pale stripes). But I find the gray curves confusing. As I understand it, the gray curves as generated include both the stripe of interest (red or blue plots) and the pale stripes. Why not just generate a three-way classification? Generating these plots in effect has already required hard classification of thin and thick stripes, so it is odd to create the gray plots, which mix two types of stripes. Alternatively, could you explicitly model the partial volume for a given cortical location (e.g., under the assumption that partial volume of thick and thin strips is indicated by the z-score) for the corresponding functional contrast? One could then estimate the relaxation times as a simple weighted sum of stripe-wise R1 or R2.

      Figure on weighted average of stripe-wise R1 and R2. (a) shows the weighted sum of R1 (de-meaned and de-curved) over all V2 voxels. z-scores from color-selective thin stripe experiments and disparity-selective thick stripes were used as weights in the left and middle group of bars, respectively. An intermediate threshold of zmax=1.96 was used, i.e., final weights were defined as weights=(z-1.96). Weights with z<0 were set to 0. For pale stripes (right group of bars), we used the maximum z-score value from thin and thick stripe measurements. We then set all weights with z≥1.96 to 0 and used the inverse as final weights. i.e., weights = -1 * (max(z)-1.96). (b) shows the same analysis for R2. Error bars indicate 1 standard error of the mean.

      (1) Yes, indeed. We agree that modeling the partial volume of each compartment (thin, thick and pale stripes) in each V2 voxel would be the most elegant approach. However, we note that z-scores between thin and thick stripe experiments may not reflect the voxel-wise partial volume effect, since they are a purely statistical measure and not a partial volume model. Having said this, we think that this general approach can give some additional insights and we provide results for a similar analysis here. We calculated the weighted sum of R1 and R2 values over all V2 voxels for each stripe compartment (thin, thick and pale stripes) independently (see above figure). For R1, we see the same pattern of R1 between stripe types as in the manuscript (Figure 5). Additionally, we show the differences here for each subject, which further demonstrates the reproducibility across subjects in our study. For R2, no clear pattern across subjects emerged, confirming the results in our manuscript. Since, this analysis did not add relavant new information to the manuscript, we refrained from adding this figure to the manuscript, in order not to overload it.

      (2) In our current study, we were not primarily interested in investigating differences between thin/thick stripes and pale stripes. While histological analysis found differences (though not consistent) between CO dark stripes (more myelinated, (Tootell et al., 1983)) and CO pale stripes (more myelinated, Krubitzer and Kaas, 1989)), no study stated myelin differences between CO dark stripes. This does not fully exclude the possibility of myelination differences but suggests that if myelination differences between CO dark stripes existed, they would presumably be smaller than differences between CO dark and CO pale stripes. Thus, it would be even more difficult to demonstrate than the hypothesis of this manuscript.

      Therefore, we decided to directly test two compartments against each other instead of modeling all three compartments within a single model. In our analysis, we thereby loosely followed the analysis methods described in Li et al. (2019), which compared myelin differences between thin/thick and pale stripes in macaques. We note that this demonstrates further consistency, since it is not trivial that both thick and thin stripes show lower R1 values than the pale stripes. For example, there may be no or opposite differences.

      (3) Just for clarification, the plots in Figure 5 show the comparison of R1 (or R2*) between two compartments in V2. The red (blue) curve includes the thin (thick) stripe of interest. The gray curve includes everything in V2 minus contributions from thick (thin) stripes of interest. If we take the thin stripe comparison as example (Figure 5a), then red contains the thin stripes of interest while gray contains everything minus the thick stripes. Therefore, assuming a tripartite stripe arrangement, the gray curve contains both thin and pale stripe contributions.

      References

      Carey D, Caprini F, Allen M, Lutti A, Weiskopf N, Rees G, Callaghan MF, Dick F. Quantitative MRI provides markers of intra-, inter-regional, and age-related differences in young adult cortical microstructure. Neuroimage 2018; 182:429–440.

      Dumoulin SO, Harvey BM, Fracasso A, Zuiderbaan W, Luijten PR, Wandell BA, Petridou N. In vivo evidence of functional and anatomical stripe-based subdivisions in human V2 and V3. Sci Rep 2017; 7:733.

      Freund P, Seif M, Weiskopf N, Friston K, Fehlings MG, Thompson AJ, Curt A. MRI in traumatic spinal cord injury: from clinical assessment to neuroimaging biomarkers. Lancet Neurol 2019; 18:1123–1135.

      González de San Román E, Bidmon H-J, Malisic M, Susnea I, Küppers A, Hübbers R, Wree A, Nischwitz V, Amunts K, Huesgen PF. Molecular composition of the human primary visual cortex profiled by multimodal mass spectrometry imaging. Brain Struct Func 2018; 223:2767–2783.

      Kirilina E, Helbling S, Morawski M, Pine K, Reimann K, Jankuhn S, Dinse J, Deistung A, Reichenbach JR, Trampel R, Geyer S, Müller L, Jakubowski N, Arendt T, Bazin P-L, Weiskopf N. Superficial white matter imaging: Contrast mechanisms and whole-brain in vivo mapping. Sci Adv 2020; 6:eaaz9281.

      Krubitzer LA, Kaas JH. Cortical integration of parallel pathways in the visual system of primates. Brain Res 1989; 478:161–165.

      Lazari A, Lipp I. Can MRI measure myelin? Systematic review, qualitative assessment, and meta-analysis of studies validating microstructural imaging with myelin histology. Neuroimage 2021; 230:117744.

      Lazari A, Salvan P, Cottaar M, Papp D, Rushworth MFS, Johansen-Berg H. Hebbian activity-dependent plasticity in white matter. Cell Rep 2022; 39:110951.

      Li X, Zhu Q, Janssens T, Arsenault JT, Vanduffel W. In Vivo Identification of Thick, Thin, and Pale Stripes of Macaque Area V2 Using Submillimeter Resolution (f)MRI at 3 T. Cereb 2019; 29:544–560.

      Mancini M, Karakuzu A, Cohen-Adad J, Cercignani M, Nichols TE, Stikov N. An interactive meta-analysis of MRI biomarkers of myelin. Elife 2020; 9:e61523.

      Natu VS, Gomez J, Barnett M, Jeska B, Kirilina E, Jaeger C, Zhen Z, Cox S, Weiner KS, Weiskopf N, Grill-Spector K. Apparent thinning of human visual cortex during childhood is associated with myelination. PNAS 2019; 116:20750–20759.

      Tootell RBH, Silverman MS, De Valois RL, Jacobs GH. Functional Organization of the Second Cortical Visual Area in Primates. Science 1983; 220:737–739.

      Weiskopf N, Edwards LJ, Helms G, Mohammadi S, Kirilina E. Quantitative magnetic resonance imaging of brain anatomy and in vivo histology. Nat Rev Phys 2021; 3:570–588.

      Whitaker KJ, Vértes PE, Romero-Garcia R, Váša F, Moutoussis M, Prabhu G, Weiskopf N, Callaghan MF, Wagstyl K, Rittman T, Tait R, Ooi C, Suckling J, Inkster B, Fonagy P, Dolan RJ, Jones PB, Goodyer IM, NSPN Consortium, Bullmore ET. Adolescence is associated with genomically patterned consolidation of the hubs of the human brain connectome. PNAS 2016; 113:9105–9110.

    1. Author Response

      Reviewer #1 (Public Review):

      Determination of the biomechanical forces and downstream pathways that direct heart valve morphogenesis is an important area of research. In the current study, potential functions of localized Yap signaling in cardiac valve morphogenesis were examined. Extensive immunostainings were performed for Yap expression, but Yap activation status as indicated by nuclear versus cytoplasmic localization, Yap dephosphorylation, or expression of downstream target genes was not examined.

      We thank the reviewer for appreciating the significance of this work, and we also thank the reviewer for the constructive suggestions. Following these suggestions, we have improved analysis of YAP activation status and used nuclear versus cytoplasmic localization to quantify YAP activation. To address the reviewer’s concerns, we have conducted extra qPCR analysis of YAP downstream target genes and YAP upstream genes in Hippo pathway. Please find the detailed revisions in our responses to the Recommendations for authors.

      The goal of the work was to determine Yap activation status relative to different mechanical environments, but no biomechanical data on developing heart valves were provided in the study.

      We appreciate the reviewer for raising this concern. We have previously published the biomechanical data of developing chick embryonic heart valves in the following study:

      Buskohl PR, Gould RA, Butcher JT. Quantification of embryonic atrioventricular valve biomechanics during morphogenesis. Journal of Biomechanics. 2012;45(5):895-902.

      In that study, we used micropipette aspiration to measure the nonlinear biomechanics (strain energy) of chick embryonic heart valves at different developmental stages. Here in this study, we used the same method to measure the strain energy of YAP activated/inhibited cushion explants and compared it to the data from our previous study. Our findings were summarized in the Results: “YAP inhibition elevated valve stiffness”, and the detailed measurements, including images and data, are presented in Figure S4.

      There are several major weaknesses that diminish enthusiasm for the study.

      1) The Hippo/Yap pathway activation leads to dephosphorylation of Yap, nuclear localization, and induced expression of downstream target genes. However, there are no data included in the study on Yap nuclear/cytoplasmic ratios, phosphorylation status, or activation of other Hippo pathway mediators. Analysis of Yap expression alone is insufficient to determine activation status since it is widely expressed in multiple cells throughout the valves. The specificity for activated Yap signaling is not apparent from the immunostainings.

      We thank the reviewer for pointing out this weakness. We have now implemented nuclear versus cytoplasmic localization as recommended to quantify YAP activation. We have also conducted additional experiments to analyze via qPCR YAP downstream target genes and YAP upstream genes in Hippo pathway. Please see the detailed revisions in our responses to the Recommendations for authors.

      2) The specific regionalized biomechanical forces acting on different regions of the valves were not measured directly or clearly compared with Yap activation status. In some cases, it seems that Yap is not present in the nuclei of endothelial cells surrounding the valve leaflets that are subject to different flow forces (Fig 1B) and the main expression is in valve interstitial subpopulations. Thus the data presented do not support differential Yap activation in endothelial cells subject to different fluid forces. There is extensive discussion of different forces acting on the valve leaflets, but the relationship to Yap signaling is not entirely clear.

      We thank the reviewer for these important questions. The region-specific biomechanics have been well mapped and studied, thanks to the help from Computational Fluid Dynamics supported by ultrasound velocity and pressure measurements. For example:

      Yalcin, H.C., Shekhar, A., McQuinn, T.C. and Butcher, J.T. (2011), Hemodynamic patterning of the avian atrioventricular valve. Dev. Dyn., 240: 23-35.

      Bharadwaj KN, Spitz C, Shekhar A, Yalcin HC, Butcher JT. Computational fluid dynamics of developing avian outflow tract heart valves. Ann Biomed Eng. 2012 Oct;40(10):2212-27. doi: 10.1007/s10439-012-0574-8.

      Ayoub S, Ferrari G, Gorman RC, Gorman JH, Schoen FJ, Sacks MS. Heart Valve Biomechanics and Underlying Mechanobiology. Compr Physiol. 2016 Sep 15;6(4):1743-1780.

      Salman HE, Alser M, Shekhar A, Gould RA, Benslimane FM, Butcher JT, et al. Effect of left atrial ligation-driven altered inflow hemodynamics on embryonic heart development: clues for prenatal progression of hypoplastic left heart syndrome. Biomechanics and Modeling in Mechanobiology. 2021;20(2):733-50.

      Ho S, Chan WX, Yap CH. Fluid mechanics of the left atrial ligation chick embryonic model of hypoplastic left heart syndrome. Biomechanics and Modeling in Mechanobiology. 2021;20(4):1337-51.

      Those studies have shown that USS develops on the inflow surface of valves while OSS develops on the outflow surface of valves, CS develops in the tip region of valves while TS develops in the regions of elongation and compaction. Here in this study, we mimic those forces in our in-vitro and ex-vivo models. This allows us to study the direct effect of specific force on the YAP activity in different cell lineages. The results showed that OSS promoted YAP activation in VECs while USS inhibited it, CS promoted YAP activation in VICs while TS inhibited it. This result well explained the spatiotemporal distribution of YAP activation in Figure 1. For example, nuclear YAP was mostly found in VECs on the fibrosa side, where OSS develops, and YAP was not expressed in the nuclei in VECs of the atrialis/ventricularis side, where USS develops. It is also worth noting that formation of OSS on the outflow side is slower, and thus the side specific YAP activation in VECs was not in effect at the early stage, from E11.5 to E14.5.

      3) The requirement for Yap signaling in heart valve remodeling as described in the title was not demonstrated through manipulation of Yap activity.

      With respect, it is unclear what the reviewer is asking for given no experiments are suggested nor an elaboration of alternative interpretations of our results that emphasize against YAP requirement. It has been previously shown that YAP signaling is required for early EMT stages of valvulogenesis using conditional YAP deletion in mice:

      Zhang H, von Gise A, Liu Q, Hu T, Tian X, He L, et al. Yap1 Is Required for Endothelial to Mesenchymal Transition of the Atrioventricular Cushion. Journal of Biological Chemistry. 2014;289(27):18681-92.

      Signaling roles for early regulators at these later fetal stages are different, sometimes opposite early EndMT stages, thus contraindicating reliance on these early data to explain later events:

      Bassen D, Wang M, Pham D, Sun S, Rao R, Singh R, et al. Hydrostatic mechanical stress regulates growth and maturation of the atrioventricular valve. Development. 2021;148(13).

      However, embryos with YAP deletion failed to form endocardial cushions and could not survive long enough for the study of its roles in later cushion growth and remodeling into valve leaflets. In this work,

      We first showed the localization of YAP activity and its direct link with local shear or pressure domains. Then we explicitly applied controlled gain and loss of function of YAP via specific molecules. We also applied critical mechanical gain or loss of function studies to demonstrate YAP mechanoactivation necessity and sufficiency to achieve growth and remodeling.

      Reviewer #2 (Public Review)

      This study by Wang et al. examines changes in YAP expression in embryonic avian cultured explants in response to high and low shear stress, as well as tensile and compressive stress. The authors show that YAP expression is increased in response to low, oscillatory shear stress, as well as high compressive stress conditions. Inhibition of YAP signaling prevents compressive stress-induced increases in circularity, decreased pHH3 expression, and increases VE-cadherin expression. On the other hand, YAP gain of function prevents tensile stress-induced decreases in pHH3 expression and VE-cadherin expansion. It also decreases the strain energy density of embryonic avian cushion explants. Finally, using an avian model of left atrial ligation, the authors demonstrate that unloaded regions within the primitive valve structures are associated with increased YAP expression, compared to regions of restricted flow where YAP expression is low. Overall, this study sheds light on the biomechanical regulation of YAP expression in developing valves.

      We thank the reviewer for the accurate summary and their enthusiasm for this work.

      Strengths of the manuscript include:

      • Novel insights into the dynamic expression pattern of YAP in valve cell populations during post-EMT stages of embryonic valvulogenesis.

      • Identify the positive regulation of YAP expression in response to low, oscillatory shear stress, as well as high compressive stress conditions.

      • Identify a link between YAP signaling in regulating stress-induced cell proliferation and valve morphogenesis.

      • The inclusion of the atrial left atrial ligation model is innovative, and the data showing distinguishable YAP expression levels between restricted, and non-restricted flow regions is insightful.

      We thank the reviewer for appreciating the strengths of this work.

      This is a descriptive study that focuses on changes in YAP expression following exposure to diverse stress conditions in embryonic avian cushion explants. Overall, the study currently lacks mechanistic insights, and conclusions based on data are highly over-interpreted, particularly given that the majority of experimental protocols rely on one method of readout.

      We thank the reviewer for constructive suggestions.

      Reviewer #3 (Public Review)

      In this manuscript, Wang et al. assess the role of wall shear stress and hydrostatic pressure during valve morphogenesis at stages where the valve elongates and takes shape. The authors elegantly demonstrate that shear and pressure have different effects on cell proliferation by modulating YAP signaling. The authors use a combination of in vitro and in vivo approaches to show that YAP signaling is activated by hydrostatic pressure changes and inhibited by wall shear stress.

      We thank the reviewer for their enthusiasm for the impact of our work.

      There are a few elements that would require clarification:

      1) The impact of YAP on valve stiffness was unclear to me. How is YAP signaling affecting stiffness? is it through cell proliferation changes? I was unclear about the model put forward:

      • Is it cell proliferation (cell proliferation fluidity tissue while non-proliferating tissue is stiffer?)

      • Is it through differential gene expression?

      This needs clarification.

      We thank the reviewer for raising this important question. Cell proliferation can affect valve stiffness but is a minor factor compared with ECM deposition and cell contractility Our micropipette aspiration data showed that the higher cell proliferation rate induced by YAP activation did lead to stiffer valves when compared to the controls. This may be because at the early stages, cells are more elastic than the viscous ECM. However, the stiffness of YAP activated valves were only about half of that of YAP inhibited valves, showing that the transcriptional level factor plays a more important role. This also suggests that YAP inhibited valves exhibited a more mature phenotype. An analogous role of YAP has also been found in cardiomyocytes. Many theories propose that in cardiomyocytes when YAP is activated the proliferation programs are turned on, while when YAP is inhibited the proliferation programs are turned off and maturation programs are released. Similarly, here we hypothesize that YAP works like a mechanobiological switch, converting mechanical signaling into the decision between growth and maturation. We have revised the Discussion to include this hypothesis.

      2) The model proposes an early asymmetric growth of the cushion leading to different shear forces (oscillatory vs unidirectional shear stress). What triggers the initial asymmetry of the cushion shape? is YAP involved?

      Although the initial geometry of the cushion model is symmetric, the force acting on it is asymmetric. The detailed numerical simulation of how the initial forces trigger the asymmetric morphogenesis can be found in our previous publication:

      Buskohl PR, Jenkins JT, Butcher JT. Computational simulation of hemodynamic-driven growth and remodeling of embryonic atrioventricular valves. Biomechanics and Modeling in Mechanobiology. 2012;11(8):1205-17.

      The color maps represent the dilatation rates when a) only pressure is applied, b) only shear stress is applied, and c) both pressure and shear stress are applied. It is such load that initiates an asymmetric morphological change, as shown in d). In addition, we believe YAP is involved during the initiation because it is directly nuclear activated by CS and OSS or cytoplasmically activated by TS and LSS.

      3) The differential expression of YAP and its correlation to cell proliferation is a little hard to see in the data presented. Drawings highlighting the main areas would help the reader to visualise the results better.

      We thank the reviewer for this helpful suggestion, we have improved the visualization of Figure 3C and Figure 4C with insets of higher magnification.

      4) The origin of osmotic/hydrostatic pressure in vivo. While shear is clearly dependent upon blood flow, it is less clear that hydrostatic pressure is solely dependent upon blood flow. For example, it has been proposed that ECM accumulation such as hyaluronic acid could modify osmotic pressure (see for example Vignes et al.PMID: 35245444). Could the authors clarify the following questions:

      • How blood flow affects osmotic pressure in vivo?

      • Is ECM a factor that could affect osmotic pressure in this system?

      We thank the reviewer for sharing this interesting study. The osmotic pressure plays a critical role in mechanotransduction and the development of many tissues including cardiovascular tissues and cartilage. As proposed in the reference, osmotic pressure is an interstitial force generated by cardiac contractility. Here in our study, the hydrostatic pressure is different, which is an external force applied by flowing blood. According to Bernoulli's law, when an incompressible fluid flows around a solid, the static pressure it applies on the solid is equal to its total pressure minus its dynamic pressure.

      Despite the difference, the osmotic pressure can mimic the effect of hydrostatic pressure in-vitro. The in-vitro osmotic pressure model has been widely used in cartilage research, for example:

      P. J. Basser, R. Schneiderman, R. A. Bank, E. Wachtel, and A. Maroudas, “Mechanical properties of the collagen network in human articular cartilage as measured by osmotic stress technique.,” Arch. Biochem. Biophys., vol. 351, no. 2, pp. 207–19, 1998.

      D. a. Narmoneva, J. Y. Wang, and L. a. Setton, “Nonuniform swelling-induced residual strains in articular cartilage,” J. Biomech., vol. 32, no. 4, pp. 401–408, 1999.

      C. L. Jablonski, S. Ferguson, A. Pozzi, and A. L. Clark, “Integrin α1β1 participates in chondrocyte transduction of osmotic stress,” Biochem. Biophys. Res. Commun., vol. 445, no. 1, pp. 184–190, 2014.

      Z. I. Johnson, I. M. Shapiro, and M. V. Risbud, “Extracellular osmolarity regulates matrix homeostasis in the intervertebral disc and articular cartilage: Evolving role of TonEBP,” Matrix Biol., vol. 40, pp. 10–16, 2014.

      When maturing cushions shift from GAGs dominated ECM to collagen dominated ECM, the water and ion retention capacity of the tissue would be greatly changed, and thus reducing the osmotic pressure. This could in turn accelerate the maturation of cushions. By contrast, the ECM of growing cushions remain GAGs dominated, which would delay maturation and prolong the growth.

      The revised second section of Results is as follows:

      Shear and hydrostatic stress regulate YAP activity

      In addition to the co-effector of the Hippo pathway, YAP is also a key mediator in mechanotransduction. Indeed, the spatiotemporal activation of YAP correlated with the changes in the mechanical environment. During valve remodeling, unidirectional shear stress (USS) develops on the inflow surface of valves, where YAP is rarely expressed in the nuclei of VECs (Figure 2A). On the other side, OSS develops on the outflow surface, where VECs with nuclear YAP localized. The YAP activation in VICs also correlated with hydrostatic pressure. The pressure generated compressive stress (CS) in the tips of valves, where VICs with nuclear YAP localized (Figure 2B). Whereas tensile stress (TS) was created in the elongated regions, where YAP was absent in VIC nuclei.

      To study the effect of shear stress on the YAP activity in VECs, we applied USS and OSS directly onto a monolayer of freshly isolated VECs. The VEC was obtained from AV cushions of chick embryonic hearts at HH25. The cushions were placed on collagen gels with endocardium adherent to the collagen and incubated to enable the VECs to migrate onto the gel. We then removed the cushions and immediately applied the shear flow to the monolayer for 24 hours. The low stress OSS (2 dyn/cm2) promoted YAP nuclear translocation in VEC (Figure 2C, E), while high stress USS (20 dyn/cm2) restrained YAP in cytoplasm.

      To study the effect of hydrostatic stress on the YAP activation in VICs, we used media with different osmolarities to mimic the CS and TS. CS was induced by hypertonic condition while TS was created by hypotonic condition, and the Unloaded (U) condition refers to the osmotically balanced media. Notably, in-vivo hydrostatic pressure is generated by flowing blood, while in-vivo osmotic pressure is generated by cardiac contractility and plays a critical role in the mechanotransduction during valve development (30). Despite the different in-vivo origination, the osmotic pressure provides a reliable model to mimic the hydrostatic pressure in-vitro (31). We cultured HH34 AV cushion explants under different loading conditions for 24 hours and found that the trapezoidal cushions adopted a spherical shape (Figure 2D). TS loaded cushions significantly compacted, and the YAP activation in VICs of TS loaded cushions was significantly lower than that in CS loaded VICs (Figure 2F).

    1. Author Response

      Reviewer #1 (Public Review):

      Huang et al. sought to study the cellular origin of Tuft cells and the molecular mechanisms that govern their specification in severe lung injury. First the authors show ectopic emergence of Tuft cells in airways and distal parenchyma following different injuries. The authors also used lineage tracing models and uncovered that p63-expressing cells and to some extent Scgb1a1-lineaged labeled cells contribute to tuft cells after injury. Further, the authors modulated multiple pathways and claim that Notch inhibition blocks tuft cells whereas Wnt inhibition enhances Tuft cell development in basal cell cultures. Finally, the authors used Trpm5 and Pou2f3 knock-out models to claim that tuft cells are indispensable for alveolar regeneration.

      In summary, the findings described in this manuscript are somewhat preliminary. The claim that the cellular origin of Tuft cells in influenza infection was not determined is incorrect. Current data from pathway modulation is preliminary and this requires genetic modulation to support their claims.

      We thank the reviewer for the comments and we have performed extensive experiments to address the reviewer’s comments. In the revised manuscript we provide additional data including genetic modulation findings to support our model.

      Major comments:

      1) The abstract sounds incomplete and does not cover all key aspects of this manuscript. Currently, it is mainly focusing on the cellular origin of Tuft cells and the role of Wnt and notch signaling. However, it completely omits the findings from Trpm5 and Pou2f3 knock-out mice. In fact, the title of the manuscript highlights the indispensable nature of tuft cells in alveolar regeneration.

      We have modified the abstract and title accordingly.

      2) In lines 93-94, the authors state that "It is also unknown what cells generate these tuft cells.....". This statement is incorrect. Rane et al., 2019 used the same p63-creER mouse line and demonstrated that all tuft cells that ectopically emerge following H1N1 infection originate from p63+ lineage labeled basal cells. Therefore, this claim is not new.

      We thank the reviewer’s comment. Although Rane et al. reported the p63-expressing lineage-negative epithelial stem/progenitor cells (LNEPs) could contribute to the ectopic tuft cells after PR8 virus infection, it is still not clear whether the p63+ cells immediately give rise to tuft cells or though EBCs. Thus, we performed TMX injection after PR8 infection, different from Rane et al (Rane et al., 2019). who performed Tmx injection before viral infection to indicate the ectopic tuft cells are derived from EBCs, as shown in revised Figure 2.

      3) Lines 152-153 state that "21.0% +/- 2.0 % tuft cells within EBCs are labeled with tdT when examined at 30 dpi...". It is not clear what the authors meant here ("within EBC's")? And also, the same sentence states that "......suggesting that club cell-derived EBCs generate a portion of tuft cells....". In this experiment, the authors used club cell lineage tracing mouse lines. So, how do the authors know that the club cell lineage-derived tuft cells came through intermediate EBC population? Current data do not show evidence for this claim. Is it possible that club cells can directly generate tuft cells?

      We apologize for the confusion and revised the text accordingly. Here, “within EBCs” means within the “pods” area where p63+ basal cells are ectopically present. The sentence is revised as “21.0% +/- 2.0 % tuft cells that are ectopically present in the parenchyma are labeled by tdT. Notably, these lineage labeled tuft cells were co-localized with EBCs.” We don’t know whether the club cell lineage-derived tuft cells transit through intermediate EBCs and that is why we use “suggest”. It is also possible that club cells can directly generate tuft cells. To avoid the confusion, we delete the sentence.

      4) Based on the data from Fig-3A, the authors claim that treatment with C59 significantly enhances tuft cell development in ALI cultures. Porcupine is known to facilitate Wnt secretion. So, which cells are producing Wnt in these cultures? It is important to determine which cells are producing Wnt and also which Wnt? Further, based on DBZ treatments, it appears that active Notch signaling is necessary to induce Tuft cell fate in basal cells. Where are Notch ligands expressed in these tissues? Is Notch active only in a small subset of basal cells (and hence generate rate tuft cells)? This is one of the key findings in this manuscript. Therefore, it is important to determine the expression pattern of Wnt and Notch pathway components.

      We thank the reviewer’s interesting questions and agree the importance of identifying the specific ligands and receptors for relevant Wnt and Notch signaling during tuft cell derivation. That being said, we think the topic is beyond the scope of this study which is focused on the role of tuft cells in alveolar regeneration. The point is well taken and we will investigate the topic in our future study.

      5) How do the authors explain different phenotypes observed in Trpm5 knockout and Pou2f3 mutants? Is it possible that Trpm5 knockout mice have a subset of tuft cells and that they might be something to do with the phenotypic discrepancy between two mutant models?

      Again we thank the reviewer for the interesting question. As discussed in the discussion section, Trpm5 is also reported to be expressed in B lymphocytes (Sakaguchi et al., 2020). It is possible that loss of Trpm5 modulates the inflammatory responses following viral infection, which may contribute to improved alveolar regeneration. However, it is also possible that Trpm5-/- mice keep a subset of tuft cells that facilitate lung regeneration as suggested by the reviewer.

      6) One of the key findings in this manuscript is that Wnt and Notch signaling play a role in Tuft cell specification. All current experiments are based on pharmacological modulation. These need to be substantiated using genetic gain loss of function models.

      We have performed the genetic studies.

      Reviewer #2 (Public Review):

      In this manuscript, the authors describe the ectopic differentiation of tuft cells that were derived from lineage-tagged p63+ cells post influenza virus infection. These tuft cells do not appear to proliferate or give rise to other lineages. They then claim that Wnt inhibitors increase the number of tuft cells while inhibiting Notch signaling decreases the number of tuft cells within Krt5+ pods after infection in vitro and in vivo. The authors further show that genetic deletion of Trpm5 in p63+ cells post-infection results in an increase in AT2 and AT1 cells in p63 lineage-tagged cells compared to control. Lastly, they demonstrate that depletion of tuft cells caused by genetic deletion of Pou2f3 in p63+ cells has no effect on the expansion or resolution of Krt5+ pods after infection, implying that tuft cells play no functional role in this process.

      Overall, in vivo and in vitro phenotypes of tuft cells and alveolar cells are clear, but the lack of detailed cellular characterization and molecular mechanisms underlying the cellular events limits the value of this study.

      We thank the reviewer for the comments and acknowledging that our findings are clear. In the revised manuscript we provide more detailed characterization and genetic evidence to elucidate the role of tuft cells in lung regeneration.

      1) Origin of tuft cells: Although the authors showed the emergence of ectopic tuft cells derived from labelled p63+ cells after infection, it cannot be ruled out that pre-existing p63+Krt5- intrapulmonary progenitors, as previously reported, can also contribute to tuft cell expansion (Rane et al. 2019; by labelling p63+ cells prior to infection, they showed that the majority of ectopic tuft cells are derived from p63+ cells after viral infection). It would be more informative if the authors show the differentiation of tuft cells derived from p63+Krt5+ cells by tracing Krt5+ cells after infection, which will tell us whether ectopic tuft cells are differentiated from ectopic basal cells within Krt5+ pods induced by virus infection.

      We thank the reviewer for the helpful suggestion. We have performed the experiment accordingly.

      2) Mechanisms of tuft cell differentiation: The authors tried to determine which signaling pathways regulate the differentiation of tuft cells from p63+ cells following infection. Although Wnt/Notch inhibitors affected the number of tuft cells derived from p63+ labelled cells, it remains unclear whether these signals directly modulate differentiation fate. The authors claimed that Wnt inhibition promotes tuft cell differentiation from ectopic basal cells. However, in Fig 3B, Wnt inhibition appears to trigger the expansion of p63+Krt5+ pod cells, resulting in increased tuft cell differentiation rather than directly enhancing tuft cell differentiation. Further, in Fig 3D, Notch inhibition appears to reduce p63+Krt5+ pod cells, resulting in decreased tuft cell differentiation. Importantly, a previous study has reported that Notch signalling is critical for Krt5+ pod expansion following influenza infection (Vaughan et al. 2015; Xi et al. 2017). Notch inhibition reduced Krt5+ pod expansion and induced their differentiation into Sftpc+ AT2 cells. In order to address the direct effect of Wnt/Notch signaling in the differentiation process of tuft cells from EBCs, the authors should provide a more detailed characterization of cellular composition (Krt5+ basal cells, club cells, ciliated cells, AT2 and AT1 cells, etc.) and activity (proliferation) within the pods with/without inhibitors/activators.

      Again we thank the reviewer for the insightful suggestions. We agree that it will be interesting to further address the direct effect of Wnt/Notch signaling in the differentiation process of tuft cells from EBCs. In this revised manuscript we added new findings of EBC differentiation into tuft cells in mice with genetic deletion of Rbpjk.

      3) Impact of Trpm5 deletion in p63+ cells: It is interesting that Trpm5 deletion promotes the expansion of AT2 and AT1 cells derived from labelled p63+ cells following infection. It would be informative to check whether Trpm5 regulates Hif1a and/or Notch activity which has been reported to induce AT2 differentiation from ectopic basal cells (Xi et al. 2017). Although the authors stated that there was no discernible reduction in the size of Krt5+ pods in mutant mice, it would be interesting to investigate the relationship between AT2/AT1 cell retaining pods and the severity of injury (e.g. large Krt5+ pods retain more/less AT2/AT1 cells compared to small pods. What about other cell types, such as club and goblet cells, in Trpm5 mutant pods? Again, it cannot be ruled out that pre-existing p63+Krt5- intrapulmonary progenitor cells can directly convert into AT2/AT1 cells upon Trpm5 deletion rather than p63+Krt5+ cells induced by infection.

      We thank the reviewer for the comments and suggestions. Our new data using KRT5-CreER mouse line confirmed that pod cells (Krt5+) do not contribute to AT2/AT1 cells, consistent with previous studies (Kanegai et al., 2016; Vaughan et al., 2015). Our data also show that p63-CreER lineage labeled AT2/AT1 cells are separated from pod cell area, suggesting pod cells and these AT2/AT1 cells are generated from different cell of origin. We also checked the Notch activity in pod cells in Trpm5-/- mice, and some pod cell-derived cells are Hes1 positive, whereas some are Hes1 negative (RLFigure 1). As indicated in discussion we think that AT2/AT1 cells are possibly derived from pre-existing AT2 cells that transiently express p63 after PR8 infection. It will be interesting to test whether Trpm5 regulates Hif1a in this population (p63+,Krt5-), and this will be our next plan.

      RLFigure 1. Representative area staining in Trpm5-/- mice at 30 dpi. Area 1: Notch signaling is active (Hes1+, arrows) in pod cells following viral infection. Area 2: pod cells exhibit reduced Notch activities. Note few Hes1+ cells in pods (arrows). Scale bar: 50 µm.

      4) Ectopic tuft cells in COVID-19 lungs: The previous study by the authors' group revealed the presence of ectopic tuft cells in COVID-19 patient samples (Melms et al. 2021). There appears to be no additional information in this manuscript.

      In Melms et al., Nature, 2021 (Melms et al., 2021), we showed tuft cell expansion in COVID-19 lungs but not the potential origin of tuft cells. In this manuscript we show some cells co-expressing POU2F3 and KRT5, suggesting a pod-to-tuft cell differentiation.

      5) Quantification information and method: Overall, the quantification method should be clarified throughout the manuscript. Further, in the method section, the authors stated that the production of various airway epithelial cell types was counted and quantified on at least 5 "random" fields of view. However, virus infection causes spatially heterogeneous injury, resulting in a difficult to measure "blind test". The authors should address how they dealt with this issue.

      We clarified that quantification method as suggested. For the in vitro cell culture assays on the signaling pathways, we took pictures from at least five random fields of view for quantification. For lung sections, we tile-scanned the lung sections including at least three lung lobes and performed quantification.

      Reviewer #3 (Public Review):

      In this manuscript Huang et al. study how the lung regenerates after severe injury due to viral infection. They focus on how tuft cells may affect regeneration of the lung by ectopic basal cells and come to the conclusion that they are not required. The manuscript is intriguing but also very puzzling. The authors claim they are specifically targeting ectopic basal progenitor cells and show that they can regenerate the alveolar epithelium in the lung following severe injury. However, it is not clear that the p63-CreERT2 line the authors are using only labels ectopic basal cells. The question is what is a basal cell? Is an ectopic basal progenitor cell only defined by Trp63 expression?

      The accompanying manuscript by Barr et al. uses a Krt5-CreERT2 line to target ectopic basal cells and using that tool the authors do not see a signification contribution of ectopic basal cells towards alveolar epithelial regeneration. As such the claim that ectopic basal cell progenitors drive alveolar epithelial regeneration is not well-founded.

      We appreciate the reviewer for the positive comments and agreeing that our findings are interesting.

      The title itself is also not very informative and is a bit misleading. That being said I think the manuscript is still very interesting and can likely easily be improved through a better validation of which cells the p63-CreERT2 tool is targeting.

      We have revised the title accordingly and performed extensive experiments to address the reviewer’s concerns.

      I, therefore, suggest the following experiments.

      1) Please analyze which cells p63-CreERT2 labels immediately after PR8 and tamoxifen treatment. Are all the tdTomato labeled cells also Krt5 and p63 positive or are some alveolar epithelial cells or other airway cell types also labeled?

      We thank the reviewer for the question. To answer the reviewer’s question, we performed PR8 infection (250 pfu) on three Trp63-CreERT2;R26tdT mice and TMX treatment at days 5 and 7 post viral infection. We didn't perform TMX injection immediately as the mice were sick at a few days post infection. The lung samples were collected at 14 dpi. We observed that tdT+ cells are present in the airways (rebuttal letter RLFigure 2A, B), and it appears that the lineage labeled cells (tdT+) include club cells (CC10+) that are underlined by tdT+Krt5+ basal cells (RLFigure 2C). We think that these labeled basal cells give rise to club cells. However, we also noticed that rare club cells and ciliated cells (FoxJ1+) are labeled by tdT in the areas absent of surrounding tdT+ basal cells (RLFigure 2D). Moreover, a minor population of tdT+ SPC+ cells are present in the terminal airways that were disrupted by viral infection (RLFigure 2E and D). We did not see any pods formed in this experiment and we did not observe any tdT+ cells in the intact alveoli (uninjured area).

      RLFigure 2. Trp63-CreERT2 lineage labeled cells in the airways but not alveoli when Tamoxifen was induced at day 5 and 7 after PR8 H1N1 viral infection. Trp63-CreERT2;R26-tdT mice were infected with PR8 at 250 pfu and Tmx were delivered at a dose of 0.25 mg/g bodyweight by oral gavage. Lung samples were collected and analyzed at 14 dpi. Stained antibodies are as indicated. Scale bar: 100 µm.

      2) Please also show if p63-CreERT2 labels any cells in the adult lung parenchyma in the absence of injury after tamoxifen treatment.

      Dr. Wellington Cardoso’s group demonstrated that Trp63-CreERT2 only labels very few cells in the airways but not the lung parenchyma in the absence of injury after tamoxifen treatment (Yang et al., 2018). Dr. Ying Yang has revisited the data and she did not observe any labeling in the lung parenchyma (n = 2).

      3) Please analyze if p63-CreERT2 labels any cells with tdTomato in the absence of injury or after PR8 infection but without tamoxifen treatment.

      We performed the experiment and didn't observe any labeled cells in the lung parenchyma without Tamoxifen treatment (n = 4).

      4) Please analyze when after PR8 infection do the first p63-CreERT2 labeled tdTomato positive alveolar epithelial cells appear.

      We administered tamoxifen at day 5 and 7 after PR8 infection and harvested lung tissues at day 14. As shown in Figure 1, we observed a few tdT+ SPC+ cells in the terminal airways that are disrupted by viral infection. Notably, we did not observe any lineage labeled cells in the intact alveoli (uninjured) in this experiment..

      5) A clonal analysis of p63-CreERT2 labeled cells using a confetti reporter might also help interpret the origin of p63-CreERT2 labeled cells.

      We thank the reviewer for the suggestion. Our new data demonstrate that a rare population of SPC+tdT+ cells are present in the disrupted terminal airways of Trp63-CreERT2;R26tdT mice. Our data in the original manuscript and the new data suggest that the initial SPC+;tdT+ cells are rare because we have to administrate multiple doses of Tamoxifen to label them. Given the less labeling efficiency of confetti than R26tdT mice, it is possible we will not be able to label these SPC+ cells. Moreover, our original manuscript clearly shows individual clones of SPC+tdT+ cells in the regenerated lung, and they do not seem to compose of multiple clones. Therefore we think that use of confetti mice may not add new information..

      6) Lastly could the authors compare the single-cell RNAseq transcription profile of p63-CREERT2 labeled cells immediately after PR8 and tamoxifen treatment and also at 60dpi. A pseudotime analysis and trajectory interference analysis could help elucidate the identity of p63-CreERT2 labeled cells that are actually not ectopic basal progenitor cells.

      We appreciated the reviewer’s suggestion and agree that single cell RNA sequencing with pseudotime analysis can provide further information regarding the origin of the lineage labeled alveolar cells of Trp63-CreERT2;R26tdT mice. That said, our new data clearly show that KRT5-CreER lineage labeled cells do not give rise to AT1/2 cells as previously described (Kanegai et al., 2016; Vaughan et al., 2015), suggesting that the ectopic basal progenitor cells do not generate alveolar cells. By contrast, Trp63-CreERT2 lineage labeled cells do give rise to AECs, suggesting that this p63+ cell population capable of generating AECs are different from Krt5+ ectopic basal progenitor cells. Our single cell core has an extremely long waiting list due to the pandemic and we hope that our new findings are enough to address the reviewer’s concern without the need of single cell analysis..

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript applies the framework of information theory to study a subset of cellular receptors (called lectins) that bind to glycan molecules, with a specific focus on the kinds of glycans that are typical of fungal pathogens. The authors use the concentration of various types of ligands as the input to the signaling channel, and measure the "response" of individual cells using a GFP reporter whose expression is driven by a promoter that responds to NFκB. While this work is overall technically solid, I would suggest that readers keep several issues in mind while evaluating these results.

      1) One of the largest potential limitations of the study is the reliance of the authors on exogenous expression of the relevant receptors in U937 cells. Using a cell-line system like this has several advantages, most notably the fact that the authors can engineer different reporters and different combinations of receptors easily into the same cells. This would be much more difficult with, say, primary cells extracted from a mouse or a human. While the ability to introduce different proteins into the cells is a benefit, the problem is that it is not clear how physiologically relevant the results are. To their credit, the authors perform several controls that suggest that differences in transfection efficiency are not the source of the differences in channel capacity between, say, dectin-1 and dectin-2. As the authors themselves clearly demonstrate, however, the differences in the properties of these signaling system are not based on receptor expression levels, but rather on some other property of the receptor. Now, it could be that the dectin-2 receptor is somehow just more "noisy" in terms of its activity compared to, say, dectin-1. This seems a somewhat less likely explanation, however, and so it is likely that downstream details of the signaling systems differ in some way between dectin-2 and the more "information efficient" receptors studied by the authors.

      The channel capacity of a cell signaling network depends critically on the distributions of the downstream signaling molecules in question: see the original paper by Cheong et al. (2011, Science 334 (6054), 354-8) and subsequent papers (notably Selimkhanov et al. (2014) Science 346 (6215), 1370-3 and Suderman et al. (2018) Interface Focus 8 (6), 20180039). The U937 cells considered here clearly don't serve the physiological function of detecting the glycans considered by the authors; despite the fact that this is an artificial cell line, the fact the authors have to exogenously express the relevant receptors indicates that these cells are not necessarily a good model for the types of cells in the body that actually have evolved to sense these glycan molecules.

      Signaling molecules readily exhibit cell-type-specific expression levels that influence cellular responses to external stimuli (Rowland et al.(2017) Nat Commun 8, 16009). So it is unclear that the distributions of downstream signaling molecules in U937 cells mirror those that would be observed in the immune cell types relevant to this response. As such, the physiological relevance of the differences between dectin-2 channel capacities and those exhibited by the other receptors are currently unclear.

      We appreciate Reviewer #1’s in-depth comments related to physiological relevance of the U937 cell. A big benefit of using information theory to investigate a biological communication channel is the realization of quantitative measurement of information that the channel transmits without having detailed measurement of spatiotemporal dynamics of receptors and downstream signaling cascades. In addition, the quantity of measured information itself in turn gives us a decent prediction about detailed signaling mechanisms by comparing the information quantity difference. For example, we investigated how transmission of glycan information from dectin-2 is synergistically modulated in the presence of either dectin-1, DC-SIGN or mincle. Our approach allows to investigate how individual lectins on immune cells contribute to glycan information transmission and be integrated in the presence other type of lectins. Therefore, the findings describe how physiologically relevant lectins are integrating the extracellular signal in a more defined way. Furthermore, we found that our model cell line has one order of magnitude higher expression of dectin-2 compared with primary human monocytes and exhibits a similar zymosan binding pattern (will be described in Recommendations for the authors and Figure R8).

      We fully agree that acquiring more information on the information transmission capability of primary immune cells would increase physiological relevance. In the revised manuscript we addressed this concern by comparing the receptor expression levels of our model cell lines with primary monocytes, for which we find an agreement of cellular heterogeneity. However, we would also like to point out that the very basic nature of our question, of how information stored in glycans is processed by lectins, is not tightly bound to these difference of primary cells and cell lines.

      Line 382: Finally, it is important to take into consideration that our conclusions came from model cell lines, which were used as a surrogate for cell-type-specific lectin expression patterns of primary immune cells. Human monocytes and dectin-2 positive U937 cells have comparable receptor densities and respond similar to stimulation with zymosan particles (SI Fig. 6A and B).

      2) Another issue that readers might want to keep in mind is that the details of the channel capacity calculation are a bit unclear as the manuscript is currently written. The authors indicate that their channel capacity calculations follow the approach of Cheong et al. (2011) Science 334 (6054), 354-8. However, the extent to which they follow that previous approach is not obvious. For instance, the calculations presented in the 2011 work use a combined bootstrapping/linear extrapolation approach to estimate the mutual information at infinite population size in order to deal with known inaccuracies in the calculation that arise from finite-size effects. The Cheong approach also deals with the question of how many bins to use in order to estimate the joint probability distribution across signal and response.

      They do this by comparing the mutual information they calculate for the real data with that calculated for random data to ensure that they are not calculating spuriously high mutual information based on having too many bins. While the Cheong et al. paper does a great job explaining why these steps need to be undertaken, a subsequent paper by Suderman et al. (2017, PNAS 114 (22), 5755-60) explains the approach in even greater detail in the supporting information. Those authors also implemented several improvements to the general approach, including a bootstrap method for more accurately estimating the error in the mutual information and channel capacity estimates.

      The problem here is that, while the authors claim to follow the approach of Cheong et al., it seems that they have re-implemented the calculation, and they do not provide sufficient detail to evaluate the extent to which they are performing the same exact calculation. Since estimates of mutual information are technically challenging, specific details of the steps in their approach would be helpful in order to understand how closely their results can be compared with the results of previous authors. For instance, Cheong et al. estimate the "channel capacity" by trying a set of likely unimodal and bimodal distributions for the input to the channel, and choosing the maximal value as the channel capacity. This is clearly a very approximate approach, since the channel capacity is defined as the supremum over an (uncountably infinite) set of input probability distributions. In any case, the authors of the current manuscript use a different approach to this maximization problem. Although it is a bit unclear how their approach works, it seems that they treat the probability of each input bin as an independent parameter (under the constraint that the probabilities sum to one) and then use an optimization algorithm implemented in Python to maximize the mutual information. In principle, this could be a better approach, since the set of input distributions considered is potentially much larger. The details of the optimization algorithm matter, however, and those are currently unclear as the paper is written.

      We thank Reviewer #1’s recommendation for increasing the legitimacy of the calculation. In the revised manuscript we tried to explain channel capacity calculation procedures in more detail with statistical approaches that adopted from Cheong et al. (2011) and Suderman et al. (2018) (SI section 1 and 2). Furthermore, we decide the number of binning from not only random dataset but also the number of total samples as shown below:

      Figure R1. A) Extrapolated channel capacity values of random dataset at infinitely subsampled distribution under various total number of samples and output binning. The white line in the heatmap represents the channel capacity value at 0.01 bit. B) Extrapolated channel capacity values at infinite subsample size of U937 cells’ input (TNF-a doses) and output (GFP reporter) response.

      Figure R1 describes channel capacity values from random (A) and experimental dataset (B, TNFAR + TNF-a). The channel capacity values from random data indicates the dependence of channel capacity on the number of the output binning and total number sample. According to this heatmap, we decided the allowed bias as 0.01 bits as shown in contour line shown in Figure R1A. Since our minimum dataset that used for channel capacity calculation in the absence of labelled input is near 90,000, the expected bias in channel capacity calculation is therefore less than 0.01 bits in binning range from 10 to 1000 as shown in Figure R1A.

      Furthermore, we demonstrated mutual information maximization procedure using predefined unibimodal input distribution and compared with the systematic method that we used in the work. We found that there is no noticeable difference in channel capacity value between two approaches (SI Figure 3M).

      3) Another issue to be careful about when interpreting these findings is the fact that the authors use logarithmic bins when calculating the channel capacity estimates. This is equivalent to saying that the "output" of the cell signaling channel is not the amount of protein produced under the control of the NFκB promoter, but rather the log of the protein level. Essentially, the authors are considering a case where the relevant output of the system is not the amount of protein itself, but the fold change in the amount of protein. That might be a reasonable assumption, especially if the protein being produced is a transcription factor whose own promoters have evolved to detect fold changes. For many proteins, however, the cell is likely responsive to linear changes in protein concentration, not fold changes. And so choosing the log of the protein level as the output may not make sense in terms of understanding how much information is actually contained in this particular output variable. Regardless, choosing logarithmic bins is not purely a matter of convenience or arbitrary choice, but rather corresponds to a very strong statement about what the relevant output of the channel is.

      We understand Reviewer #1’s concern regarding the choice of log binning. We found that if the number of binning is higher than 200, no matter the binning methods, including linear, logarithmic or equal frequency, the estimated channel capacities in each binning number are converged into the same value. The only difference is how quickly the values approach the converged channel capacity as increasing the binning number (shown in Figure R2). In the revised manuscript, we used linear binning to represent more relevant protein signaling as the Reviewer mentioned. Note that the channel capacity values calculated from linear binning do not show noticeable different from our previously calculated channel capacity values.

      On the other hand, linear binning generates significant bias, if we consider labelled input (i.e., continuous input) into channel capacity calculation, due to the increase of binning in input region.

      Figure R2. Output binning number and binning method dependence of channel capacity value for experimental dataset. The inset plots show the relative difference of channel capacity value to the maximum channel capacity value in the entire binning range (i.e., from 10 to 1000) of the corresponding binning method.

      According to Reviewer #1’s comment we have changed the binning method from logarithmic binning to linear binning in the whole experimental dataset except in the presence of labelled input (i.e., dectin-2 antibody). If we consider channel capacity between labelled input and NF-kB reporter, equal frequency binning is used for every layer of the channel capacity (i.e., labelled input-binding, binding-GFP, labelled input-GFP)

      Reviewer #2 (Public Review):

      My expertise is more on the theoretical than the experimental aspects of this paper, so those will be the focus of these comments.

      Signal transduction is an important area of study for mathematical biologists and biophysicists. This setting is a natural one for information-theoretic methods, and such methods are attracting increasing research interest. Experimental results that attempt to directly quantify the Shannon capacity of signal transduction are particularly interesting. This paper represents an important contribution to this emerging field.

      My main comments are about the rigorousness and correctness of the theoretical results. More details about these results would improve the paper and help the reader understand the results.

      We understand reviewer #2’s comment related with rigorousness and correctness of the theoretical results of this work. In the revised manuscript, we added following contents to help the reader to better understand the channel capacity calculation procedures.

      • General illustrative introduction regarding how we measured input and output dataset and how we handle those data to prepare joint probability distribution shown in SI section 1.1 and 1.2.

      • Exemplified mutual information maximization procedure using experimental and arbitrary dataset shown in SI section 1.3.

      The calculation of channel capacity, given in the methods, is quite a standard calculation and appears to be correct. However, I was confused by the use of the "weighting value" w_i, which is not specified in the manuscript. The input distribution appears to be a product of the weight w_i and the input probability value p_i, and these appear always to occur together as a product w_i p_i. (In joint probabilities w_i p(i,j), the input probability can be extracted using Bayes' rule, leaving w_i p_i p(j|i).) This leads met wonder two things. First, what role does w_i play (is it even necessary)? Second, of particular interest here is the capacity-achieving input distribution p_i, but w_i obscures it; is the physical input distribution p_i equal to the capacity-achieving distribution? If not, what is the meaning of capacity?

      We thank Reviewer #2’s comment regarding the arbitrariness of the weightings. We realize there was a lack of explanation on the weighting values in the original manuscript. 𝑃x(𝑖) is a marginal probability distribution of input from the original dataset and 𝑃x'(𝑖) is the marginal probability distribution of modified input that maximize the mutual information. In usual case 𝑃x(𝑖) is not equal to 𝑃x'(𝑖) and therefore one needs to find 𝑃x'(𝑖) from 𝑃x(𝑖). Because 𝑃x'(𝑖) is a linear combination of 𝑃x(𝑖), it can be expressed as 𝑤(𝑖)𝑃x(𝑖) , where 𝑤(𝑖) is the weightings, under constraint ∑input/i 𝑤(𝑖)𝑃x (𝑖) = 1 . The changed input distribution, in turn, modifies the joint probability distribution as 𝑃'xy (𝑖, 𝑗) = 𝑤(𝑖)𝑃xy)(𝑖, 𝑗). To help readers understand of this work we expanded the Appendix with illustrative descriptions.

      A more minor but important point: the inputs and outputs of the communication channel are never explicitly defined, which makes the meaning of the results unclear. When evaluating the capacity of an information channel, the inputs X and outputs Y should be carefully defined, so that the mutual information I(X;Y) is meaningful; the mutual information is then maximized to obtain capacity. Although it can be inferred that the input X is the ligand concentration, and the output Y is the expression of GFP, it would be helpful if this were stated explicitly.

      We agree with Reviewer’s suggestion for better description of input and output in the manuscript. Therefore, we have modified Figure 1 A and B and the main text to describe the source of input and output much clearly, as follows:

      Line 92: Accounting for the stochastic behavior of cellular signaling, information theory provides robust and quantitative tools to analyze complex communication channels. A fundamental metric of information theory is entropy, which determines the amount of disorder or uncertainty of variables. In this respect, cellular signaling pathways having high variability of the initiating input signals (e.g. stimulants) and the corresponding highly variable output response (i.e. cellular signaling) can be characterized as a high entropy. Importantly, input and output can have mutual dependence and therefore knowing the input distribution can partly provide the information of output distribution. If noise is present in the communication channel, input and output have reduced mutual dependence. This mutual dependence between input and output is called mutual information. Mutual information is, therefore, a function of input distribution and the upper bound of mutual information is called channel capacity (SI section 1) (Cover and Thomas, 2012). In this report, a communication channel describes signal transduction pathway of C-type lectin receptor, which ultimately lead to NF-κB translocation and finally GFP expression in the reporter model (Fig. 1A). To quantify the signaling information of the communication channels, we used channel capacity. Importantly, the channel capacity isn’t merely describing the resulting maximum intensity of the reporter cells. The channel capacity takes cellular variation and activation across a whole range of incoming stimulus of single cell resolved data into account and quantifies all of that data into a single number.

    1. Author Response

      Reviewer #3 (Public Review):

      The authors examine the role of secreted BAFF in senescence phenotypes in THP1 AML cells and primary human fibroblasts. In the former, BAFF is found to potentiate the inflammatory phenotype (SASP) and in the latter to potentiate cell cycle arrest. This is an important study because the SASP is still largely considered in generic and monolithic terms, and it is necessary to deconvolute the SASP and examine its many components individually and in different contexts.

      Although the results show differences for BAFF in the two cell models, there are many places where key results are missing and the results over-interpreted and/or missing controls.

      1) Figure 1. Test whether the upregulation of BAFF is specific to senescence, or also in reversible quiescence arrest.

      We appreciate the Reviewer’s requests. We performed the experiments in fibroblasts and THP-1 cells to assess BAFF levels in quiescence. As shown below in the figure for Reviewers, we induced quiescence in fibroblasts by serum starvation (0.1%) for 96 h and confirmed the quiescent state by measuring two markers of quiescence (reduction of CCND1 mRNA and reduction of phopho-S6, when compared to cycling cells, following markers established previously (PMID 25483060) (panel A). In this case, the level of BAFF mRNA was increased upon quiescence (panel B).

      In THP-1 cells, we tried to induce quiescence by serum starvation and glutamine depletion for 96 h. Unfortunately, however, inducing quiescence in THP-1 cells was rather challenging, likely because they are cancer cells. Thus, we observed a reduction of cell proliferation in both conditions, but we observed a reduction in phospho-S6 only in the samples without glutamine (panel C). We failed to see increased BAFF mRNA levels in quiescent THP-1 cells after either serum starvation or glutamine depletion (panel D).

      In summary, further studies will be necessary to fully understand if the increased expression of BAFF seen in senescent cells is also observed in other conditions of growth suppression (such as quiescence or differentiation), as well as whether this effect is specific to different cell types.

      2) Figure 1, Supplement 1G. Show negative control IgG for immunofluorescence.

      We thank the Reviewer for this suggestion. Along with other changes during the revision, we decided to remove the immunofluorescence data in order to include more informative data.

      3) All results with siRNA should be validated with at least 2 individual siRNAs to eliminate the possibility of off-target effects.

      We agree with the Reviewer on the importance of testing individual siRNAs. For BAFF, we originally tested two independent siRNAs (BAFF#1 and BAFF#2) individually, but we also pooled them for additional analysis (and referred to simply as “BAFFsi” along the manuscript). In the revised version of our manuscript, we included the key experiments performed with these two individual BAFF siRNAs. Upon BAFF silencing in THP-1 cells, we observed a reduction of SASP factors and SA-β-Gal activity levels with each individual siRNA (Figure 4-Figure Supplement 1D-F) and with the pooled siRNAs (Figure 4C). For WI-38 cells, we observed a reduction of p53 levels with individual and pooled siRNAs (Figure 7-Figure Supplement 1A), as well as a reduction in IL6 levels and SA-β-Gal activity (Figure 6-Figure Supplement 1D,E). After IRF1 silencing, we observed a reduction in BAFF pre-mRNA with two different pairs of CTRLsi and IRF1si pools (Figure 2I and supplementary Figure 2E). For the data on BAFF receptors, we used SMARTpools from Dharmacon, which are combinations of 4 siRNAs designed by the company to minimize off-target effects. These additions and clarifications are indicated in the revised manuscript.

      4) To confirm a role for IRF1 in the activation of BAFF, the authors should confirm the binding of IRF1 to the BAFF promoter by ChIP or ChIP-seq.

      We thank the Reviewer for this suggestion. We performed ChIP-qPCR analysis in THP-1 cells that were either proliferating or rendered senescent after exposure to IR (Figure 2H, Materials and methods section), and we confirmed the binding of IRF1 to the proximal promoter region of BAFF. As anticipated, this interaction was stronger after inducing senescence.

      5) Key antibodies should be validated by siRNA knockdown of their targets, for example, TACI, BCMA, and BAFF-R in Figure 5. Note that there is an apparent discrepancy between BCMA data in Figure 5B vs 5C.

      We fully agree with the Reviewer on this point and we thank him/her for helping us to improve this part of our manuscript. To address the discrepancy regarding BCMA western blot analysis and flow cytometry data, we silenced BCMA in THP-1 cells and tested two different antibodies advertised to recognize BCMA. This experiment allowed us to identify the correct band for BCMA by western blot analysis. We then confirmed that BCMA is upregulated in senescence, as observed by both western blot and flow cytometry analyses. We have modified the manuscript to reflect these changes. Please find these data in Figure 5A,B and Figure 5-Figure Supplement 1A of the revised manuscript.

      6) Figure 5E. Negative/specificity controls for this assay should be shown.

      We thank the reviewer for this comment and regret that we were unable to provide a negative control. The kit only provides a competitive wild-type oligomer used to test the specificity of the binding. For each sample (CTRLsi, BAFFsi, CTRLsi IR, BAFFsi IR) and each antibody tested (p65, p50, p52, RelB and c-Rel), we evaluated the reductions in signal upon addition of excess competitive oligomer per well (20 pmol/well) compared to wells with an inactive oligomer. However, the negative control was performed only as single replicate, due to the limited quantity of nuclear extracts and the high number of samples and antibodies analyzed. We therefore considered this control as being ‘qualitative’ rather than fully ‘quantitative’.

      7) Hybridization arrays such as Figure 5H, Figure 6 - Supplement 1I, and Figure 6H should be shown as quantitated, normalized data with statistics from replicates.

      We appreciate this request. We have included the quantification and statistics to the phosphoarrays used for THP-1 and WI-38 cells, which had been performed in triplicate (Figure 7A, Figure 5-Figure Supplement 1D). The original arrays are shown in the respective Source Data Files. In the interest of space, we removed the cytokine array performed on IMR-90 cells and left instead the quantitative ELISA for IL6 (Figure 6-Figure Supplement 1F). The data obtained from the cytokine array analysis in Figure 4F and Figure 4-Supplemental Figure 1C are supported by quantitative multiplex ELISA measurements (Figure 4E and Figure 4C).

      8) Figure 6B - Supplement 1. Controls to confirm fractionation (i.e., non-contamination by cytosolic and nuclear proteins) should be shown.

      We thank the Reviewer for this suggestion. We tested the efficiency of fractionation and we did in fact observe some degree of contamination from cytosolic proteins using the earlier version of the kit (Pierce, cat. 89881). We therefore purchased an improved version of the kit (Pierce, cat. A44390) and repeated the surface fractionation assay, which this time showed improved fractionation (Figure 7-Figure Supplement 1B). Interestingly, with the improved fractionation strategy, we observed that BAFF receptors in fibroblasts were almost exclusively localized inside the cell and not on the surface, as we found in THP-1 cells. Further validation of BAFF receptor antibodies has been provided in Figure 5-Figure Supplement 1A. As described in the text, the intracellular localization of BAFF receptors was previously reported in other cell types and conditions (PMID 31137630, PMID 19258594, PMID 30333819, PMID 10903733), and thus it is possible that BAFF may act through non-canonical mechanisms in WI-38 cells. Nonetheless, we did detect a small amount of BAFFR on the cell surface, and furthermore, BAFFR silencing reduced the level of p53 in fibroblasts. Therefore, we propose that BAFFR may be the primary receptor involved in p53 regulation in fibroblasts (Figure 7-Figure Supplement 1B,C). Our data on BAFF receptors deserve deeper characterization in a future study of the functions of BAFF receptors in senescence.

      9) Figure 6A. Knockdown of BAFF should be shown by western blot.

      Yes, definitely. We appreciate this comment and have included BAFF knockdown data in fibroblasts by western blot analysis (Figure 7B).

      10) Figure 6G. Although BAFF knockdown decreases the expression of p53, p21 increases. How do the authors explain this?

      We thank the Reviewer for the interesting question. We too were surprised to observe that the p53-dependent transcripts regulated by BAFF did not include CDKN1A (p21) mRNA, as confirmed by western blot analysis. The accumulation of p21 in senescence can be also regulated by p53-independent pathways and in p53-/- cells, for example by p90RSK, SP1, and ZNF84 (PMID 24136223, PMID 25051367, PMID 33925586). Eventually, we removed the data relative to p21 and γ-H2AX in favor of other data and to streamline the content of this manuscript for the reader.

    1. Author Response

      Reviewer #1 (Public Review):

      1-1. I do have some concerns that the differences in network clustering reported in Fig 6 may be due to noise and I think the comparisons against the HCP parcellation could be more robust. Specifically, with regard to the network clustering in Fig 6. The authors use a clustering algorithm (which is not explained) to cluster the parcels into different functional networks. They achieve this by estimating the mean time series for each parcel in each individual, which they then correlate between the n regions, to generate an nxn connectivity matrix. This they then binarise, before averaging across individuals within an age group. It strikes me that binarising before averaging will artificially reduce connections for which only a subset of individuals are set to zero. Therefore averaging should really occur before binarising. Then I think the stability of these clusters should be explored by creating random repeat and generation groups (as done for the original parcells) or just by bootstrapping the process. I would be interested to see whether after all this the observation that the posterior frontoparietal expands to include the parahippocampal gryus from 3-6 months and then disappears at 9 months - remains.

      We thank the reviewer for this insightful comment on our clustering process. For the step of “binarizing before averaging”, we followed the method proposed by Yeo et al (1). In this method, all correlation matrices are binarized according to the individual-specific thresholds. Specifically, each individual-specific threshold is determined according to the percentile, and only 10% of connections are kept and set to 1, while all other connections are set to 0. Yeo et al. (1) explained their motivation for doing so as “the binarization of the correlation matrix leads to significantly better clustering results, although the algorithm appears robust to the particular choice of the threshold”. We consider that the possible reason is that the binarization of connectivity in each individual offers a certain level of normalization so that each subject can contribute the same number of connections. If averaging occurs before binarizing, the actual connectivity contributed by different subjects would be different, which leads to bias. Meanwhile, we tested the stability of ‘binarizing first’ and ‘averaging first’, and the result is shown in Fig. R1 below. This figure suggests a similar conclusion as (1), where binarizing first before averaging leads to better clustering stability. We added the motivation of binarizing before averaging in the revised manuscript between line 577 and line 581.

      Fig. R1. The comparison of clustering stability of different methods. The red line refers to the clustering stability when binarizing the correlation matrices first and then averaging the matrices across individuals, while the blue line refers to the clustering stability when averaging the correlation matrices across individuals first and then binarizing the average matrix.

      For the final clustering results, we performed our clustering method using bootstrapping 100 times, and the final result is a majority voting of each parcel. The comparison of these two results is shown in Fig. R2. Overall, we do observe good repeatability between these two results. However, we also observed that some parcels show different patterns between the two results, especially for those parcels that are spatially located around the boundaries of networks or the medial wall. The pattern of the observation that “the posterior frontoparietal expands to include the parahippocampal gyrus from 3-6 months and then disappears at 9 months – remains” was not repeated in the bootstrapped results. These results might suggest that the clustering method is quite robust, the discovered patterns are relatively stable, and the differences between our original results and bootstrapping results might be caused by noises or inter-subject variabilities.

      Fig. R2. Top panel: the network clustering results using all data in the original manuscript. Bottom panel: the network clustering results using majority voting through 100 times of bootstrapping. Black circles and red arrows point to the parahippocampal gyrus, which was included in the posterior frontoparietal network, and is not well repeated in the bootstrapped results. (M: months)

      1-2. Then with regard to the comparison against the HCP parcellation, this is only qualitative. The authors should see whether the comparison is quantitatively better relative to the null clusterings that they produce.

      Thank you for this great suggestion! As suggested, we added this quantitative comparison using the Hausdorff distance. Similar to the comparison in parcel variance and homogeneity, the 1,000 null parcellations were created by randomly rotating our parcellation with small angles on the spherical surface 1,000 times. We compared our parcellation and the null parcellations by accordingly evaluating their Hausdorff distances to some specific areas of the HCP parcellation on the spherical space, including Brodmann's area 2, 3b, 4+3a, 44+45, V1, and MT+MST. The results are listed in Figure 4. From the results, we can observe that our parcellation generally shows statistically much lower Hausdorff distances to the HCP parcellation, suggesting that our parcellation generates parcel borders that are closer to HCP parcellations compared to the null parcellations.

      However, we noticed very few null parcellations that show smaller Hausdorff distances compared to our parcellation. A possible reason comes from our surface registration process with the HCP template purely based on cortical folding, without using functional gradient density maps, which are not available in the HCP template. As a result, this does not ensure high-quality functional alignment between our infant data and the HCP space, thus inevitably increasing the Hausdorff distance between our parcellation and the HCP parcellation.

      1-3. … not all individuals appear (from Fig 8) to be acquired exactly at the desired timepoints, so maybe the authors might comment on why they decided not to apply any kernel weighted or smoothing to their averaging? Pg. 8 'and parcel numbers show slight changes that follow a multi-peak fluctuation, with inflection ages of 9 and 18 months' explain - the parcels per age group vary - with age with peaks at 9 and 18 - could this be due to differences in the subject numbers, or the subjects that were scanned at that point?

      We do agree with the reviewer that subjects are not scanned at similar time points. This is designed in the data acquisition protocol to seamlessly cover the early postnatal stage so that we will have a quasi-continuous observation of the dynamic early brain development.

      We didn’t apply kernel weighted average or smoothing when generating the parcellation, as we would like each scan to contribute equally, and each parcellation map could be representative of the cohort of the covered age, instead of only part of them. Meanwhile, our final ‘age-common parcellation’ could be representative of all subjects from birth to 2 years of age. However, we do agree that the parcellation map that is only designed for the use of a specific age, e.g., 1-year-olds, kernel weighted average, or even a more restricted age range could be a more appropriate solution.

      For the parcel number that likely shows fluctuations with subject numbers, we added an experiment, where we randomly selected 100 scans by considering the minimum scan number in each age group using bootstrapping and repeated this process 100 times. The average parcel number of each age is reported in the following Table R1. We didn’t observe strong changes in parcel numbers when reducing scan numbers, which further demonstrates that our parcel numbers do not show a strong relation to subject numbers. However, the parcel number does not increase greatly from 18M to 24M in the bootstrapping results, so we modified the statement in the manuscript about the parcel number to ‘… all parcel numbers fall between 461 to 493 per hemisphere, where the parcel number attains a maximum at around 9 months and then reduces slightly and remains relatively stable afterward. …’, which can be found between line 121 and line 122.

      1-4. I also have some residual concerns over the number of parcels reported, specifically as to whether all of this represents fine-grained functional organisation, or whether some of it represents noise. The number of parcels reported is very high. While Glasser et al 2016 reports 360 as a lower bound, it seems unlikely that the number of parcels estimated by that method would greatly exceed 400. This would align with the previous work of Van Essen et al (which the authors cite as 53) which suggests a high bound of 400 regions. While accepting Eickhoff's argument that a more modular view of parcellation might be appropriate, these are infants with underdeveloped brain function.

      We thank the reviewer for this insightful comment. We agree that there might be noises for some of the parcels, as noises exist in each step, such as data acquisition, image processing, surface reconstruction, and registration, especially considering functional MRI is noisier than structural MRI. Though our experiments show that our parcellation is fine-grained and is suitable for the study of the infant brain functional development, it is hard to directly quantitatively validate as there is no ground truth available.

      Despite these, we are still motivated to create fine-grained parcellations, as with the increase of bigger and higher resolution imaging data and advanced computational methods, parcellations with more fine-grained regions are desired for downstream analyses, especially considering the hierarchical nature of the brain organization (2). And the main reason that our method generates much finer parcellation maps, is that both our registration and parcellation process is based on the functional gradient density, which characterizes a fine-grained feature map based on fMRI. This leads to both better inter-subject alignment in functional boundaries and finer region partitions. This strategy is different from Glasser et al (3), which jointly considers multimodal information for defining parcel boundaries, thus parcels revealed purely by functional MRI might be ignored in the HCP parcellation. We hope our parcellation framework can be a useful reference for this research direction. We added this discussion in the revised manuscript between line 268 and line 271.

      For the parcel number, even without performing surface registration based on fine-grained functional features, recent adult fMRI-based parcellations greatly increased parcel numbers, such as up to 1,000 parcels in Schaefer et al. (4), 518 parcels in Peng et al. (5), and 1,600 parcels in Zhao et al. (6). For infants, we do agree that the infant functional connectivity might not be as strong as in adults. However, there are opinions (7-9) that the basic units of functional organization are likely to present in infant brains, and brain functional development gradually shapes the brain networks. Therefore, the functional parcel units in infants could be possibly on a comparable scale to adults. Even so, we do agree that more research needs to be performed on larger datasets for better evaluations. We added this discussion in the revised manuscript between line 275 and line 280.

      1-5. Further comparisons across different subjects based on small parcels increases the chances of downstream analyses incorporating image registration noise, since as Glasser et al 2016 noted, there are many examples of topographic variation, which diffeomorphic registration cannot match. Therefore averaging across individuals would likely lose this granularity. I'm not sure how to test this beyond showing that the networks work well for downstream analyses but I think these issues should be discussed.

      We agree with the reviewer that averaging across individuals inevitably brings some registration errors to the parcellation, especially for regions with high topographic variation across subjects, which would lead to loss of granularity in these regions. We believe this is an important issue that exists in most methods on group-level parcellations, and the eventual solution might be individualized parcellation, which will be our future work. We added this discussion in the revised manuscript between line 288 and line 292.

      We also agree with the reviewer that downstream analyses are important evaluations for parcellations. We provided a beta version of our parcellation with 602 parcels (10) to our colleagues, and they tested our parcellation in the task of infant individual recognition across ages using functional connectivity, to explore infant functional connectome fingerprinting (10). We compared the performance of different parcellations with 602 ROIs (our beta version), 360 ROIs (HCP MMP parcellation (3)), and 68 ROIs (FreeSurfer parcellation (11)). The results (Fig. R3) show that our parcellation with a higher parcellation number yields better accuracy compared to other parcellations. We added a description of this downstream application in the discussion between line 284 and line 287.

      Fig. R3. The comparison of different parcellations for infant individual recognition across age based on functional connectivity (figure source: Hu et al. (10)). The parcellation with 602 ROIs is the beta version of our parcellation, 360 ROIs stands for HCP MMP parcellation (3) and 68 ROIs stands for the FreeSurfer parcellation (11). This downstream task shows that a higher parcellation number does lead to better accuracy in the application.

      1-6. Finally, I feel the methods lack clarity in some areas and that many key references are missing. In general I don't think that key methods should be described only through references to other papers. And there are many references, particular to FSL papers, that are missing.

      We thank the reviewer for this great suggestion. We added related references for FLIRT, FSL, MCFLIRT, and TOPUP For the alignment to the HCP 32k_LR space, we first aligned all subjects to the fsaverage space using spherical demons, and then used part of the HCP pipeline (12) to map the surface from the fsaverage space to HCP 164k_LR space, and downsampled to 32k_LR space. We modified this citation by referencing the HCP pipeline by Glasser et al. (12) instead and detailed this registration process in the revised manuscript between line 434 to line 440 in the revised manuscript and as below:

      “… The population-mean surface maps were mapped to the HCP 164k ‘fs_LR’ space using the deformation field that deforms the ‘fsaverage’ space to the ‘fs_LR’ space released by Van Essen et al. (13), which was obtained by landmark-based registration. By concatenating the three deformation fields of steps 1, 3, and 4, we directly warped all cortical surfaces from individual scan spaces to the HCP 164k_LR space and then resampled them to 32k_LR using the HCP pipeline (12), thus establishing vertex-to-vertex correspondences across individuals and ages …”

      Reviewer #2 (Public Review):

      2-1. Diminishing enthusiasm is the lack of focus in the result section, the frequent use of jargon, and figures that are often difficult to interpret. If those issues are addressed, the proposed atlas could have a high impact in the field especially as it is aligned with the template of the Human Connectome Project.

      We’d like to thank Reviewer #2 for the appreciation of our atlas. According to the reviewer’s suggestion, we went through the manuscript again by focusing on correcting the use of jargon, clarity in the result section, as well as figures and figure captions. We hope our corrections can help explain our work to a broader community. Our revisions are accordingly detailed in the following. Meanwhile, our parcellation maps have been aligned with the templates in HCP and FreeSurfer and made available via NITRC at: https://www.nitrc.org/projects/infantsurfatlas/.

      References

      1. B. Thomas Yeo, F. M. Krienen, J. Sepulcre, M. R. Sabuncu, D. Lashkari, M. Hollinshead, J. L. Roffman, J. W. Smoller, L. Zöllei, J. R. Polimeni, The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of neurophysiology 106, 1125-1165 (2011).

      2. S. B. Eickhoff, R. T. Constable, B. T. Yeo, Topographic organization of the cerebral cortex and brain cartography. NeuroImage 170, 332-347 (2018).

      3. M. F. Glasser, T. S. Coalson, E. C. Robinson, C. D. Hacker, J. Harwell, E. Yacoub, K. Ugurbil, J. Andersson, C. F. Beckmann, M. Jenkinson, S. M. Smith, D. C. Van Essen, A multi-modal parcellation of human cerebral cortex. Nature 536, 171-178 (2016).

      4. A. Schaefer, R. Kong, E. M. Gordon, T. O. Laumann, X.-N. Zuo, A. J. Holmes, S. B. Eickhoff, B. T. J. C. C. Yeo, Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. 28, 3095-3114 (2018).

      5. L. Peng, Z. Luo, L.-L. Zeng, C. Hou, H. Shen, Z. Zhou, D. Hu, Parcellating the human brain using resting-state dynamic functional connectivity. Cerebral Cortex, (2022).

      6. J. Zhao, C. Tang, J. Nie, Functional parcellation of individual cerebral cortex based on functional mri. Neuroinformatics 18, 295-306 (2020).

      7. W. Gao, S. Alcauter, J. K. Smith, J. H. Gilmore, W. Lin, Development of human brain cortical network architecture during infancy. Brain Structure and Function 220, 1173-1186 (2015).

      8. W. Gao, H. Zhu, K. S. Giovanello, J. K. Smith, D. Shen, J. H. Gilmore, W. J. P. o. t. N. A. o. S. Lin, Evidence on the emergence of the brain's default network from 2-week-old to 2-year-old healthy pediatric subjects. 106, 6790-6795 (2009).

      9. K. Keunen, S. J. Counsell, M. J. J. N. Benders, The emergence of functional architecture during early brain development. 160, 2-14 (2017).

      10. D. Hu, F. Wang, H. Zhang, Z. Wu, Z. Zhou, G. Li, L. Wang, W. Lin, G. Li, U. U. B. C. P. Consortium, Existence of Functional Connectome Fingerprint during Infancy and Its Stability over Months. Journal of Neuroscience 42, 377-389 (2022).

      11. R. S. Desikan, F. Ségonne, B. Fischl, B. T. Quinn, B. C. Dickerson, D. Blacker, R. L. Buckner, A. M. Dale, R. P. Maguire, B. T. Hyman, An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage 31, 968-980 (2006).

      12. M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi, M. Webster, J. R. Polimeni, The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage 80, 105-124 (2013).

    1. Author Response:

      Reviewer #1 (Public Review):

      The authors present a system that allows the measurement of OCR on diverse tissues. Using two optopes, one before the tissue under examination, and one after, allows the OCR to be measured as the difference between the concentration of O2 in the in-flow gas and the concentration of O2 in the out-flow gas. The system maintains the tissue at a set concentration of dissolved O2 so that experiments can be performed over a long period of time. The authors have provided ample data and full methods and their conclusions are most likely reliable.

      Currently, we know that O2 is critical for diverse physiological processes, however it is rarely as well controlled for as well as non-gas solutes such as glucose, as we lack methods to control its delivery and infer its consumption. By addressing this need, the authors contribute something valuable to the field, which will hopefully be built on by others. The authors have already begun to show the utility of their system by exploring the complicated biology of H2S. As delivering this gas in a controlled manner is hard, often people use NaHS instead. In line with previous studies (well cited by the authors), differences are observed.

      Specific points

      1) The gas control system is used with islets, INS-1 832/12 cells, retinas, and liver tissue, demonstrating its broad applicability.

      2) The system as a platform can have diverse extra measurement modalities attached to it, for example visible-wavelength absorbance and fluorescence. Metabolite concentrations in the tissue culture outflow could also be measured.

      3) The reduction state of cyt c and cyt c oxidase are measured from the second derivative of absorbance at 550 and 605 nm. Ideally, to reliably decompose these signals full spectra around 550-605 nm would be collected. As the authors are only using cytochrome reduction state as a qualitative measure and appear careful to avoid over-interpretation this method should be fine. However, the authors ought to show a representative time course including the fully oxidised and reduced states demonstrating this approach as making these measurements is demanding and will depend on the exact spectroscopic set-up. Without this information it is hard to judge the reliability of the paper.

      We appreciate giving us the latitude for a less robust measurement. However, we actually did do what you have suggested should be done. That is, with the Ocean Optics spectrophotometer, we measure the full light spectrum from 400 to 650. Using this spectral data, we calculate the first and second derivatives of the absorption. We have previously published our approach to spectral analysis, as well as the inclusion of the fully oxidized and reduced states (Sweet IR, G Khalil, AR Wallen, M Steedman, KA Schenkman, JA Reems, SE Kahn, JB Callis. Continuous measurement of oxygen consumption by pancreatic islets. Diabetes Tech. Ther. 4: 661-672, 2002; Sweet IR, Cook DL, DeJulio E, Wallen AR, Khalil G, Callis JB, Reems JA: Regulation of ATP/ADP in pancreatic islets. Diabetes 53:401-409. 2004), so we did not include all the details. In order to ensure that our description is clear, we have added a more thorough explanation that we used spectral analysis and not just data obtained as single wavelengths.

      Reviewer #2 (Public Review):

      The present project is an extension of prior work from this work group in which they describe a technological advancement to their published flow-culture system. Such improvements now incorporate technology that allows for metabolic characterization of mammalian tissues while precisely controlling the concentration of abundant gases (e.g., O2), as well as trace gases (e.g., H2S). The present article demonstrates the utility of this system in the context of hypoxia/re-oxygenation experiments, as well as exposure to H2S. Although the methodology described herein is clearly capable of detecting nuanced metabolic changes in response to variations in O2 or H2S, the lack of a head-to-head comparison with other techniques makes it difficult to discern the potential impact of the technology.

      We understand the benefit of comparing compare a new method with the currently utilized methods. However, the novelty of our methodology is that it is able to control the exposure of tissue to levels of both abundant and trace dissolved gas composition, functions that neither of these existing instruments provide. In addition, continuous flow of media allows maintenance and assessment of tissue models that cannot be accommodated by static or spinner systems. Since we are the first to report an entirely novel technology, the direct comparison to benchmarks is not possible. In the past, however, we have tested liver slices and retina in a Seahorse and the tissue died within 120 minutes presumably due to the lack of flow/reoxygenation in the tissue. In addition, islets placed in spinner systems such as the Oxygraph become fragmented and broken very rapidly. So, a head to head comparison on the tissue OCR response to changes in gas composition cannot be meaningfully carried out for the facets of our method that we highlighted. The methodology we present has capabilities that do not exist in any other commercially available system. We have stated this latter point in the last line of the second paragraph of the Introduction. Regarding the general reliability of the O2 consumption measurement: the unprecedented accuracy and stability of the O2 detectors and the ability of our flow system to maintain tissue for days while generating accurate and reproducible measurements of O2 consumption has previously been established (Sweet IR, Gilbert M, Sabek O, Fraga DW, Gaber AO, Reems JA. Glucose Stimulation of Cytochrome c Reduction and Oxygen Consumption as Assessment of Human Islet Quality. Transplantation 80: 1003- 1011, 2005; Neal AS, Rountree AM, Philips CW, Kavanagh TJ, Williams DP, Newham P, Khalil G, Cook DL, Sweet IR. Quantification of low-level drug effects using real-time, in vitro measurement of oxygen consumption rate. Toxicological Sciences 148: 594-602, 2015).

      In addition, diffusion gradients both in the bath, as well as the tissue itself likely impact the accuracy of the metabolic measurements. This is likely relevant for the liver slices experiments.

      We agree that there are certainly concentration gradients within tissue, and these are increased in the absence of capillary flow. Nonetheless, the gradients will certainly be less than what occurs in static systems. In general, optimal size of tissue pieces are a trade-off between potential for hypoxia if the tissue is too large, and a lack of untraumatized tissue if it is too small. We have added text to address this concern that these effects are to be considered when choosing the size and shape of the liver slices or other tissue models to place into the flow system.

      Following resection, liver tissue can be mechanically permeabilized (PMID: 12054447). In the present experiments, no controls were put in place to discern if the tissue was permeabilized. This could be checked by adding in adenylates and additional carbon substrates and assessing the impact on OCR. Similar controls likely need to be implemented for the islet and retina experiments.

      As we have used flow systems in the past to maintain islets and liver for 24 hours and more (Neal AS, Rountree AM, Kernan K, Van Yserloo B, Zhang H, Reed BJ, Osborne W, Wang W, Sweet IR. Real time imaging of intracellular hydrogen peroxide in pancreatic islets. Biochem. J. 473:4443-4456, 2016; Neal AS, Rountree AM, Philips CW, Kavanagh TJ, Williams DP, Newham P, Khalil G, Cook DL, Sweet IR. Quantification of low-level drug effects using real-time, in vitro measurement of oxygen consumption rate. Toxicological Sciences 148: p. 594-602, 2015) and based on stable OCR we concluded that the tissue is viable. However, it is possible that the membranes of some of the tissue would become permeabilized which would affect the responses to test compounds. We considered this issue from two perspectives. 1. Whether established models that we used to test the BaroFuse were prone to high cell permeability; and 2. Whether loading and maintenance of the tissue models in the fluidics system resulted in increased permeability. We did do experiments measuring the ADP responses in OCR by islets and retina within the fluidics system. Effects were observable but small. However, these results are not definitive, because it was difficult to know what the response in permeabilized tissue was (and permeabilizing tissue slices was difficult). We then used Propidium Iodide staining to visualize and quantify the level of permeability. In islets, the fluorescence in isolated islets before and after perifusion was negligible compared to that in islets permeabilized by H2O2 treatment (see below).

      Fig. 1. Staining of isolated rat islets with the indicator of cell membrane integrity propidium iodide. Islets were stained either before or after a 3-hour perifusion. As a positive control for PI staining, islets were treated with 500 uM H2O2 for 30 minutes and incubated overnight. Each data point was the average +/- SE for an n of 3.

      There was some fluorescence in retina and liver however, but it was difficult to interpret this data in terms of a fraction of the tissue that is permeabilized due to the fact that dye close to the surface of the tissue is preferentially imaged. So, we finally assessed the amount of permeabilized tissue in retina and liver by comparing uptake of 3H H2O and an extracellular marker C14 sucrose.

      Fig. 2. Fraction of tissue water space that is accessible to the extracellular marker sucrose. Left: Mouse retina. Right: Rat liver slice. Each data point was the average +/- SE for an n of 3.

      Extracellular water in liver and retina is well established to be about 25%, close to the volume of distribution of sucrose. Thus, we cannot rule out that there are a small percentage of cells that are permeabilized, but the vast majority are not.

      Additional comments are detailed below:

      -The experiments with H2S are particularly interesting, as this system does seem well suited to investigate the metabolic effects of H2S.

      Thanks! We are excited by the potential for this method to assess the effects of H2S and other trace gases.

      -The authors state the transient rise in O2 consumption was surprising; however, accumulation of succinate during ischemia and rapid oxidation upon reperfusion has been previously demonstrated (PMID: 32863205).

      This is an interesting paper which describes findings that speak to the role of succinate in supplying fuel that could drive the transient changes in O2 consumption observed following hypoxia. It would be an interesting experiment to perform our hypoxia-reoxygenation experiment in the absence and presence of the permeable malonate to see if the spike in O2 consumption following reoxygenation was absent in the presence of the drug. We have removed the word surprising and cited this paper.

      -In the paper, Zaprinast was used to block pyruvate uptake. However, the rationale to use this compound, as opposed to the more specific MPC inhibitor UK5099 is unclear.

      We could have used UK5099, but we had used Zaprinast in past studies (Du J, Cleghorn WM, Contreras L, Lindsay K, Rountree AM, Chertov AO, Turner SJ, Sahaboglu A, Linton J, Sadilek M, Satrústegui I, Sweet IR, Paquet-Durand F, Hurley JB. Inhibition of mitochondrial pyruvate transport by Zaprinast causes massive accumulation of aspartate at the expense of glutamate in retinas. J Biol. Chem, 288:36129-40, 2013) and so we knew that in our hands that it blocked pyruvate mitochondrial uptake and would therefore be a good test of the rapid transfer of pyruvate across the plasma membrane.

      -Throughout the paper, the authors list 'COVID-19' as a potential application. It is not clear how this technology could be used in the context of COVID-19.

      Reference to COVID-19 has been removed.

    1. Author Response

      Reviewer #1 (Public Review):

      Wang and Dudko derive analytical equations for one special case of a model of Ca-dependent vesicle fusion, in the attempt to find a "general theory" of synaptic transmission. They use a model with 2 kinetically distinct fast and slow pools (equation 1).

      Critique

      1) Overall, the analytical approach applied here remains limited to the quite arbitrarily chosen 2-pool model. Thus, while the authors are able to re-capitulate the kinetics of transmitter release under a series of defined intracellular Ca-concentration steps, [Ca]i (see Fig. 2B; data from Woelfel et al. 2007 J. Neuroscience), this is nevertheless not surprising because the data by Woelfel et al. was originally also fit with a 2-pool model. More importantly, the 2-pool model is valid for describing release kinetics at high [Ca]i, but it cannot account for other important phenomena of synaptic transmission like e.g. spontaneous and asynchronous release which happen at lower [Ca]i, with different Ca cooperativity (Lou et al., 2005). Along the same lines, the derivations of the equations by Wang and Dudko are not valid in the range of low [Ca]i below about 1 micromolar (see "private recommendations" for details). This, however, limits the applicability of the model to AP-driven transmitter release, and it shows that based on one specific arbitrarily chosen model (here: the 2-pool model), one cannot claim to build a realistic and full "theory" for synaptic transmission.

      Our two-pool description is far from being “arbitrarily chosen”. It is based on experimental facts that have been established by multiple independent laboratories: namely, the observed two distinct vesicle fusion kinetics due to the presence of the readily releasable and reserve pools in vivo and due to the presence of two dominant vesicle morphologies in vitro. The two-pool picture has been confirmed and successfully used in numerous experimental papers previously. That being said, our two-pool description refers to a more general notion of separation of timescales and is thus more flexible than a literal interpretation might suggest.

      The data from Woelfel et al. 2007 J. Neuroscience, while of excellent quality, are not the only measured kinetics of the action-potential triggered vesicle fusion that our theory has been able to recapitulate (see other experimental data in Fig.2 and Fig.3 of the manuscript). The theory also recapitulates the kinetic measurements from fifteen other independent experimental studies, on ten different types of synapses. The dynamic range (peak release rate) of these synapses vary by 10 orders of magnitude, and the range of Ca2+ concentrations spans more than 3 orders of magnitude. Our work recapitulates these 16 datasets not through 16 different ad-hoc models but through a single, fully analytically solved, theoretical framework. Importantly, beyond recapitulating the existing data, our analytically tractable theory enables one to extract the unique sets of microscopic parameters for particular synapses, such as the activation energies and kinetic rates of their synaptic machinery, the sizes of the vesicle pools and the critical number of SNAREs. We verify that these predictions from our theory have reasonable values for each of the data sets; this is an additional, non-trivial check of our theory. The fact that our theory reproduces observations on such strikingly diverse systems, and has such a degree of predictive power, cannot be dismissed as an artifact or coincidence. We are not aware of any other theory, nor fitting model, of comparable generality and the ability to generate concrete predictions.

      Reviewer #1 is mistaken in stating that the derivations of our equations are not valid below 1 micromolar Ca2+ concentrations. It is evident already from Figure R1 below (Fig.2 in the revised manuscript) that the theory performs flawlessly at concentrations as low as 0.1µM. There are indeed non-linear effects at ultra-low Ca2+ concentrations that are not displayed by the experimental data in Fig. R1. Our theory is also applicable in that regime: one simply needs to include a second coordinate (in addition to the number of Ca2+ ions bound, 𝑄‡ ) to account for the multidimensionality of the free energy landscape, analogous to the calculations of the rate constants for multidimensional activated rate processes in chemical physics. This illustrates just one of the many ways in which our theory will enable detailed studies of mechanistic aspects of synaptic transmission.

      With further regards to generality, as stated in our Abstract, this paper is concerned with providing a physical theory to describe “rapid and precise neuronal communication” enabled by “a highly synchronous release” of neurotransmitters. Typically, more than 90% of the neurotransmitters are released through synchronous release during the action potential. By applying our theory to each of the multiple Ca!" sensors one will be able to cover the remaining <10% of the neurotransmitters and thus simultaneously describe spontaneous, asynchronous and synchronous release. While detailed studies of these effects are clearly beyond the scope of this work, our theory opens a door for such studies by providing a foundation in the form of a conceptual, analytically tractable framework.

      2) In their derivations, Wang and Dudko collapse the intracellular Ca-concentration [Ca]i, a parameter directly quantified in the several original experiments that went into Fig. 2A, into a dimensionless relative [Ca]i "c" (see equation 7). Similarly, the release rates are collapsed into a dimensionless quantity. With these normalizations, Ca-dependent transmitter release measured in several preparations seems to fall onto a single theoretical prediction (Fig. 2A). The deeper meaning behind the equalization of the data was unclear, except a demonstration that the data from these different experiments can in general be described with a two-pool model, which is at the core of the dimensionless equations. One issue might be that many of the original data sets used here derive from the same preparation (the calyx of Held), and therefore the previous data might not scatter strongly between studies. This could be clarified by the authors by also plotting the data from all studies on the non-normalized [Ca]i axis for comparison. Furthermore, it would be useful to include data from other preparations, like the inner hair cells (Beutner et al. 2001 Neuron; their Fig. 3) which likely have a lower Ca-sensitivity, i.e. are right-shifted as compared to the calyx (see discussion in Woelfel & Schneggenburger 2003 J. Neuroscience). Thus, it is unclear why normalization of [Ca]i to "c" should be an advantage, because differences in the intracellular Ca sensitivity of vesicle fusion exist between synapses (see above), and likely represent important physiological differences between secretory systems.

      We thank the Reviewer for challenging our work with the hypothesis that the demonstrated universal scaling of the experimental data could in fact be an artefact caused by pre-selecting the data with the same preparation – addressing this hypothesis is indeed a compelling test to probe the true limits of generality of our theory. Below we carry out this test. We implemented the two suggestions of the Reviewer: (i) we added datasets on markedly different synaptic preparations, including the inner hair cells as suggested by the Reviewer, as well as retina bipolar cell, hippocampal mossy fiber, cerebella basket cell, chromaffin cell, insulin-secreting cell, and additional data on Calyx of Held from multiple laboratories, and (ii) we plotted the data on the non-normalized axis of [Ca2+] to reveal the full extent of scatter among the data sets. The resulting plot (Fig. R1 below) speaks for itself: in vivo data for the release rate span 4 orders of magnitude at low [Ca2+] and 6 orders of magnitude at high [Ca2+], and there is a 10 orders of magnitude difference between the release rates from in vivo and in vitro data. The scatter across 4-10 orders of magnitude allows one to appreciate the vastly different sensitivities to [Ca2+] between synaptic preparations (Fig.R1, left). Yet, all these data collapse beautifully on the master curve established by our theory (Fig.R1, right).

      Fig. R1. Despite 10 orders of magnitude variation in the release rate of different synaptic preparations and more than 3 orders of magnitude range of calcium concentration (left), the data collapse onto a universal curve predicted by the theory (right). The universal collapse indicates that the established scaling (Eq. 7) is universal across different synapses. The distinct sets of parameters for individual synapses (Appendix 3 Table 2) is a demonstration of the predictive power of the theory as a tool for extracting the unique properties of each synapse from experimental data.

      What the Reviewer refers to as “the equalization of the data” is known in statistical physics as universality. The deeper meaning of a universal scaling is its indication that the observed phenomena realized in seemingly unrelated systems are in fact governed by common physical principles. The collapse of the data onto the universal curve in Fig. R1 is a demonstration that the present theory has uncovered, quantitatively, unifying physical principles underneath the striking diversity and bewildering complexity of chemical synapses. The Referee is of course correct that the differences in [Ca2+] sensitivities among synapses likely represent important physiological differences between distinct synapses and distinct secretory systems. The present theory does not negate these differences, but it in fact allows one to quantify these differences through the unique sets of extracted parameters for individual synapses (see Appendix 3 Table 2). We are not aware of any other theory that has demonstrated universality in synaptic transmission through a simple, single scaling relation across 10 orders of magnitude in dynamic range and at the same time allowed the extraction of the microscopic parameters that are unique for the individual synapses and thus reflect the diversity of their synaptic machinery. We included Fig. R1 shown here in the revised manuscript (Figure 2).

      3) Finally, the authors use their model to derive the number of SNARE proteins necessary for vesicle fusion, and they arrive at the quite strong conclusion that N = 2 SNAREs are required. Nevertheless, this estimate doesn't fit with the number of n = 4-5 Ca2+ ions which the original studies of Fig. 2A consistently found. The Ca-sensitivity at the calyx of Held, and the steepness of the release rate versus [Ca]i relation is determined by Ca-binding to Synatotagmin-2 (the specific Ca sensor isoform found at the calyx synapse), as has been determined in molecular studies at the calyx synapse (see Sun et al. 2007 Nature; Kochubey & Schneggenburger 2011 Neuron). Furthermore, in other secretory cells, the number of SNARE proteins has been estimated to be {greater than or equal to} 3 (Mohrmann et al., Science 2010).

      The Reviewer is incorrect in their claim that there is any discrepancy here. The number of SNAREs N and the number of Ca2+ ions 𝑄‡ , extracted from the fit to our theory, are actually in a good agreement with the findings from the studies mentioned by the Reviewer. To clarify, the parameter 𝑄‡ is the number of 𝐶𝑎!" ions bound to a SNARE at the transition state (not final state) of the free energy landscape of a SNARE complex. Appendix 3 Table 2 shows that, for all synaptic preparations, the extracted values at the transition state are 𝑄‡ < 4 − 5, which is indeed consistent with n = 4 − 5 at the final state. We note that, in addition, our theory enables one to extract the key energetic parameter that governs synaptic vesicle fusion: the activation free energy barrier ∆𝐺‡ of SNARE conformational transition (in the range 8-34 kBT for different synaptic preparations, see Appendix 3 Table 2), which, to our knowledge, has not been possible to extract from these experiments before.

      The specific value N=2 was extracted from a particular data set for Calyx of Held (Woelfel et al 2007), for which the temporal curves of cumulative release at different Ca2+ concentrations were available. It is quite possible that the value of N will be different for some other synapses. As we emphasize in the manuscript (see Discussion), the present theory does not declare the same value of N for all types of synapses; the power of the theory lies in providing a fitting tool for extracting this value for a system of interest.

      Taken together, the derivation of the analytical equations for the kinetic scheme of a 2-pool model is mathematically interesting, and the scholarly derived equations are trustworthy. Nevertheless, the derived analytical model in fact captures only a specific stage of synaptic transmission focusing on Ca-dependent fusion of vesicles from two pools at [Ca]i >1 microM. Other important processes and mechanistic components (e.g. spontaneous, asynchronous release, Ca-dependent pool replenishment, postsynaptic factors) are either over-simplified or remained out of the scope of the theory. Therefore, the paper is far from providing a general "theory for synaptic transmission", as the title promises.

      We appreciate that the Reviewer sees our analytical derivations as being mathematically interesting, scholarly derived, and trustworthy. We believe that we have convincingly refuted the Reviewer’s criticisms regarding perceived limitations. We have shown that our universal scaling and collapse is not limited to high calcium concentrations, and have presented checks using data from vastly different synaptic preparations. As noted above, the generality of a theory is determined not by the amount of details packed in it but by the ability of the theory to reproduce observations and generate predictions regarding the phenomenon of interest (here: rapid and precise neuronal communication) while containing as few details as possible. Our theory accomplishes just that; it delivers precisely what our title promises.

      Reviewer #2 (Public Review):

      The present MS describes an effort to create a general mathematical model of synaptic neurotransmission. The authors invested great efforts to create a complex model of the presynaptic mechanisms, but their approach of the postsynaptic mechanisms is way oversimplified. The authors claim that their model is consistent with lots of in vivo and in vitro experimental data, but this night be true for a small subselection of experimental papers (they cite 7 experimental papers regularly in the MS!). The authors also indicate that their modeling has a realistic foundation, namely they can relate some parameters in their equations to molecules/molecular mechanisms. One example is the parameter N, which they claim indicate the number of SNARE complexes requires for fusion. The reviewer finds it rather misleading because it alludes that there is a parameter for complexin, Rim1, Rim-BP, Munc13-1 etc... The equations clearly cannot formulate and reflect diversity due to different isoforms of even the above mentioned key presynaptic molecules.

      We appreciate that the Reviewer found 7 different experimental papers – covering different synapses and different experimental setups – to be “a small subselection”. We believe that Fig. R1 above (response to Reviewer #1 point 2), which uses 16 different experimental papers, leaves no further doubts that the claims about the consistency between the theory and data are fully justified. Despite up to 10 orders of magnitude variation in the release rate of different synaptic preparations and more than 3 orders of magnitude range of calcium concentrations (Fig. R1, left), all the data collapse onto a universal curve predicted by our theory (Fig. R1, right). These data represent different systems – from the central nervous system to the secretory system – and come from in vivo and in vitro experiments. The data we have used cover the measurements on all synaptic systems that we could find in the literature on the action potential-driven neurotransmitter release. If the Reviewer is aware of any existing data on other synaptic systems that we might have missed, we will gratefully appreciate the opportunity to apply the theory to those data as well.

      The diversity of the molecular components in different synapses is captured in our theory through different values of the microscopic parameters Δ𝐺‡, 𝑄‡ and 𝑘( . These parameters describe, respectively, the activation energy barrier, the number of bound Ca2+ ions, and the intrinsic rate of the conformational transition of the SNARE complexes that drive synaptic vesicle fusion in a given synapse. Different isoforms of the individual components of SNARE complexes and scaffold proteins, including the proteins mentioned by the Reviewer, will be reflected in different values of Δ𝐺‡, 𝑄‡ and 𝑘( for specific synaptic preparations, as can be seen in Appendix 3 Table 2 in the manuscript. These parameters capture the energetic and kinetic properties of the synaptic fusion machinery as a complex rather than as a collection of isolated molecules. Because the molecular components within a SNARE complex act collectively (hence the name “complex”) to drive vesicle fusion, it is natural (and indeed fortunate) that the predictive power of the theory can be preserved with only a few key parameters of the molecular machinery as opposed to requiring a long list of parameters for every specific isoform of each of the many individual molecular components.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors present a strong set of experiments to uncover what type of role non-mutant stromal cells might be playing in the development of VM and AST, two vascular lesions that share some similarities.

      Questions about experimental design.

      1) For quantification of gene expression in VM and AST specimens in Figure 2, the methods say qPCR data were normalized to housekeeping genes, but it would be helpful to normalize to endothelial content. It might be that increased TGFa is due to increased endothelium.

      We thank the Reviewer for this excellent suggestion. We have now added this new data as suggested with normalization of TGFA mRNA to the endothelial marker PECAM-1/CD31 mRNA. A trend towards an increased expression of TGFA mRNA was detected in VM/AST specimens in comparison to the control group. We also show in the manuscript that besides CD31-positive vascular structures, TGFA is expressed in intervascular areas, i.e. between the vessels, in the patients’ lesions (Fig.2) and in lesion-derived CD31negative intervascular stromal cells. These data altogether demonstrate that i) TGFA is expressed also in other cell types than endothelial cells and ii) indicates that the increased expression of TGFA in lesion samples is not only due to increased vasculature/endothelium in the patient samples.

      The new RT-qPCR data has now been added to the manuscript as a new Fig. 2 - figure supplement 1.

      2) The mutant allelic frequency for the HUVEC-PIK3CA WT versus HUVEC-PIK3CA H1047R should be provided. This is critically needed for the interpretation of the results.

      Thank you for this valuable comment. To confirm that PIK3CA H1047R is still present in transduced HUVECs at the end-point of the mouse xenograft experiment, we performed a new ddPCR analysis detecting fractional abundance of PIK3CA p.H1047R from the matrigel plug-in samples. In this new data, mean fractional abundance of PIK3CA p.H1047R in fibroblast containing PIK3CA H1047R EC plugs was shown to be 27.1 % (variation 26.5-27.8 %; n=2 mice in duplicates). This corresponds to ~54 % of PIK3CA p.H1047R mutation positive cells in the plug, assuming a single copy of the mutation in each cell. As a control group, no positivity was detected in samples with fibroblast and in PIK3CAwt EC, as all the cells express the wildtype form of the PIK3CA gene. Please see Author Response Image 1 representative 2D amplification plots of the mutation analysis. Fractional abundances of PIK3CA mutations in the patient tissue samples and patient-derived CD31+ cells can also be seen in Table 1 and were in a range of 5-12 % (whole tissue) and 44-51 % (EC fraction).

      3) From Figure 5, it appears that the human primary fibroblasts are not required for the mutant ECs to form perfused vessels (panel H).

      We thank the Reviewer for the comment and agree that based on our H&E staining and erythrocyte analysis, perfused vessels are evident in PIK3CA mutant plugs containing ECs with fibroblasts but also in plugs containing ECs alone. This was expected as PIK3CA mutation in ECs alone has shown to be a driver of venous malformation. However, prior to our study the role of fibroblasts in PIK3CA-driven lesions had not been studied. To better understand the role of fibroblasts in lesion formation, we have now added new data to the manuscript containing example images of the PIK3CA H1047R plugs with or without fibroblasts, and added a new quantitation of their erythrocyte amount. Please see Author Response Image 2. Our data demonstrates that there are significantly: i) more CD31-positive vascular structures (Fig. 5E-G), ii) larger lumens (Fig. 5D-F) and iii) more erythrocyte-containing regions, indicative for perfused vessels (new Fig. 5H) in lesions with fibroblasts in comparison to plugs containing ECs alone. This implies that fibroblasts further induce PIK3CA-driven EC lesion formation.

      Author Response Image 2. Vascular structures formed with PIK3CA H1047R ECs alone and PIK3CA H1047R ECs + FBs in mouse xenograft plugs. In the figure panel, H&E staining on each individual plug in these groups is presented. Equal size close-up images were taken from the middle of each plug covering > 50% of plug area (scale bar 250µm). More erythrocytes (red) are seen in the plugs with fibroblasts in comparison to ECs alone. Scanned images of the H&E stained whole tissue sections can be seen in the Fig. 5 – source file.

      A new quantitative analysis of erythrocyte positive area in relation to whole plug area using SproutAngio quantification tool was additionally performed (). Analysis was done on a blinded manner and showed significantly increased erythrocyte amount in the plugs containing PIK3CA H1047R ECs and fibroblasts (in comparison to EC alone). Describtion of the analysis has now been added in the manuscript (p. 42, rows 839-843) Figures 5G and 5H in the manuscript were updated to show statistics and automated intensitybased quantification of the erythrocyte positive area/ plug instead of erythrocyte scoring (scale 0-3).

      Is it possible that TGFa from the ECs is sufficient to drive vascular malformation?

      Mutations in genes such as PIK3CA, TEK and KRAS have been shown to drive formation of vascular anomalies. Thus it is unlikely that a single growth factor, such as increased expression of TGFA, would drive this process alone. That being said, our data shows that TGFA is able to regulate proliferation of PIK3CA mutated ECs via secondary mechanism (Fig. 4F), and we show that inhibition of EGFR pathway is able to reduce PIK3CA-driven lesion growth in mice (Fig. 7). As our bulk RNA-sequencing data from patient-derived cells, showed expression of also other growth factors in lesion ECs (Table 3), it is likely that multiple angiogenic growth factors are involved in lesion formation similarly as in tumors and their expression is primarily driven by mutated cells and secondary by cell-cell crosstalk with other lesion cell types. Thus, targeting of multiple signalling pathways could be a beneficial treatment strategy in the future.

      Reviewer #2 (Public Review):

      In this manuscript, Ilmonen H. et al explored potential crosstalk between endothelial cells and fibroblasts in a context of sporadic vascular malformation (venous malformation and angiomatoses of soft tissue). With a high level of evidence, they found that mutated endothelial cells secrete TGFA that will activate surrounding fibroblasts, leading in turn to VEGFA secretion that will stimulate endothelial cell sprouting and vascular malformation development. Experiments are well-designed and support their hypothesis. Some controls are missing, particularly in Fig. 2. Indeed, it is mandatory to provide data from healthy skin biopsies (that are available in many laboratories): TGFa, CD31, P-EGFR staining.

      We thank the Reviewer for the comments. Although it is common that VM presents in skin, in this work we solely focused on intramuscular and subcutaneous AST and VM patient samples and excluded the samples containing skin from this study. We did TGFA immunostainings from healthy skeletal muscle that can be seen Figure 2 – figure supplement 2B. CD31 staining of vessels in healthy skeletal muscle near the resection margin can be seen in Figure 1B. Please see below also tissue locations of all VM and AST samples in this study:

      • Intramuscular, 42.1 % of lesions (n=16)

      • Intramuscular and subcutaneous, 21.1 % of lesions (n=8)

      • Intramuscular, subcutaneous and synovial membrane, 5.3 % of lesions (n=2)

      • Intramuscular and synovial membrane, 2.6 % of lesions (n=1)

      • Subcutaneous and synovial membrane, 2.6 % of lesions (n=1)

      • Subcutaneous only, 26.3 % of lesions (n=10)

      • Skin, none of the lesions

    1. Author Response:

      Reviewer #1 (Public Review):

      The key question that the authors were addressing was how ethnicity differentially affects the microbiota of subjects living in a particular area (in this case East Asians and Caucasians living in San Francisco that have been enrolled in an 'Inflammation, Diabetes, Ethnicity and Obesity cohort - although inflammatory disease was apparently excluded in these subjects).

      The existence of differences between different populations allows potential discrimination of the underlying factors - such as host genetics, diet, lifestyle, physiological parameters, body habitus or other environmental influences. In this case body habitus has been selected as a stratification factor between the two ethnicities. Immigration potentially allows distinction of environmental and host genetical influences.

      The strength of the study is in the level of robust analysis of the microbiotas by a very experienced group of researchers, distinguishing the microbiota differences, especially in lean subject, with analysis of associations that may be driving the differences. It is interesting that diet is not one of the apparent associations in this study, yet the relationship of microbiota diversity to body habitus is strong in Caucasian subjects. These associations cannot easily be extrapolated to causation or mechanism - a fact well recognized in the paper - but remain important observations that rationalize in vivo modeling with experimental animals or in vitro analyses of microbial interactions between different taxa simulating the context of differences in the intestinal milieu. The paper includes work showing that differences of the microbiota can be recapitulated after transfer to germ-free mice, at least over the short term: this is important to provide tools to model the reasons for differences in consortial composition.

      A very large amount of work required to assemble the samples and the clinical phenotypic metadata set making the data an important and definitive contribution for the subjects studied. Of course, this is one sample of extremely variable human conditions and lifestyles that will help build the overall picture of how differences in our genetics and environment shape our intestinal microbiota.

      We appreciate the reviewers' positive summary of our manuscript and agree with the reviewer’s assessment of the need for both mechanistic follow-on studies and extensions to larger and more diverse cohorts.

      Reviewer #2 (Public Review):

      The study's primary aims are to test for the differences in the microbiome between self-identified East Asian and White subjects from the San Francisco area in the new IDEO cohort. The study builds on an growing literature which describes variations among ethnic groups. The major conclusion of "emphasize the utility of studying diverse ethnic groups" is not novel to the literature.

      It was not our intention to imply that our study is novel in studying two distinct ethnic groups, but rather to emphasize that differences exist between ethnicities with regard to the gut microbiome and to provide a systematic analysis of this including gnotobiotic mouse models along a key health disparity in Asian Americans. We include references of prior examples of this work in our introduction (including several references in our introductory paragraph). We have modified our abstract to clarify this point further:

      “Taken together, our findings add to the growing body of literature describing variation between ethnicities and provide a starting point for defining the mechanisms through which the microbiome may shape disparate health outcomes in East Asians.”

      Overall, the strength of the results is that they confirm patterns from different cohorts/studies and demonstrate that ethnic-related differences are common. The results are subject to sample size concerns that may underpin some of the conflicting or lack of significant results. For instance, there is no overlap in highlighted species-level taxonomy differences between 16S and metagenomic analyses, which precludes a clear interpretation of the meaning of those differences and whether taxa should be highlighted in the abstract; there are low AUC values for the random forest modelling; and there is a lack of significance in correlations between BMI and East Asian subjects in F4a where there may be a correlation. While a minor point, it serves to highlight the sample sizes as the range of the variation in East Asian subjects is not as substantial as the White subjects because there are fewer East Asian data points above a 30 BMI (~N=5) relative to those of White subjects (~N=11).

      We agree that our study was limited by sample size and that future studies increasing sample size would be valuable to assess the intersection of metabolic health in colocalized EA and W subjects. We include this in our discussion:

      “Due to the investment of resources into ensuring a high level of phenotypic information on each cohort member, and due to its restricted geographical catchment area, the IDEO cohort was relatively small at the time of this analysis (n=46 individuals). This study only focused on two of the major ethnicities in the San Francisco Bay Area; as IDEO continues to expand and diversify its membership, we hope to study a sufficient number of participants from other ethnic groups in the future.”

      The microbiome transfers from humans to mice also demonstrate that certain features of interpersonal or ethnic-related differences can be established in mice. This is useful for future studies, but it is not unexpected in and of itself given the robustness of transferring microbiome differences in other human-to-mouse studies. If the phenotype data were more compelling, then the utility of these transfers could be valuable.

      We respectfully disagree with this point. To our knowledge, this is the first study demonstrating that ethnicity-associated differences in the gut microbiota are stable following transplantation, which is certainly not guaranteed given the marked and currently unpredictable variations between donor and recipient microbiotas shown here and in prior studies by us (Nayak et al., 2021; Turnbaugh et al., 2009b) and others (Walter et al., 2020).

      We state this rationale in our results section:

      “Taken together, our results support the hypothesis that there are stable ethnicity-associated signatures within the gut microbiota of lean EA vs. W individuals that are independent of diet. To experimentally test this hypothesis, we transplanted the gut microbiotas of two representative lean W and lean EA individuals into germ-free male C57BL/6J mice…Next, we sought to assess the reproducibility of these findings across multiple donors and in the context of a distinctive dietary pressure. We fed 20 germ-free male mice a high-fat, high-sugar (HFHS) diet for 4 weeks prior to colonization with a gut microbiota from one of 5 W and 5 EA donors....”

      Furthermore, while the phenotypic data may not be as dramatic as the reviewer had hoped, this is to our knowledge the first demonstration that ethnicity-associated differences in the gut microbiota play a causal role in host phenotypes, as highlighted in our discussion:

      “Our results in humans and mouse models support the broad potential for downstream consequences of ethnicity-associated differences in the gut microbiome for metabolic syndrome and potentially other disease areas. However, the causal relationships and how they can be understood in the context of the broader differences in host phenotype between ethnicities require further study.”

      However, in the current state, I am concerned with the experimental design since the LFPP experiments used N=1 donor per ethnicity for establishing the mice colonies and are resultantly confounded by mice pseudo-replication with recipient mice derived from one donor of each ethnicity. This concern is relevant to interpreting results back to interpersonal or interethnic variation. Are phenotypic differences due to individual differences or ethnic differences? It's not clear.

      We presented our data in summary form integrating the results from 3 independent experiments across two figures. To account for pseudoreplication as the reviewer suggests, we have restricted permutational space to account for one donor for multiple recipient mice using the parameters outlined in the adonis software package. Analyzing our results from 3 separate experiments, our results are statistically significant, which we mention in the revised text:

      “In a pooled analysis of all gnotobiotic experiments accounting for one donor for multiple recipient mice, ethnicity and diet were both significantly associated with variations in the gut microbiota (Fig. S9), consistent with the extensive published data demonstrating the rapid and reproducible impact of a HFHS diet on the mouse and human gut microbiota (Bisanz et al., 2019).”

      Figure S9. Combined analysis of recipient mice reveals significant associations with donor ethnicity and recipient diet. A PhILR PCoA is plotted based on 16S-Seq data from all gnotobiotic experiments. Individual mice are colored by (A) donor ethnicity or (B) the recipient’s diet. Both ethnicity and diet were statistically significant contributors to variance (ADONIS p-values and estimated variance displayed using blocks restricted by donor identifiers to account for one donor going to multiple recipient mice). We also observed a trend for interaction between diet and ethnicity in this model (p=0.068, R2=0.047, ADONIS).

      The HFHS experiment also used N=5 donors that somewhat mitigates these concerns, but mixed sexes were used here and there can be sex-specific human microbiome differences.

      Our study was designed to evaluate ethnicity and metabolic health. As we report in our original and updated analysis, we found no significant associations between the gut microbiota and biological sex (Figs. 2E and S4) in the IDEO cohort, perhaps due to the small effect size of sex reported in prior studies by other groups (Arumugam et al., 2011; Ding and Schloss, 2014; Schnorr et al., 2014; Zhang et al., 2021) coupled to the limited size of the current IDEO cohort.

      The Turnbaugh and Koliwad labs use mixed sexes as donors for studies in conventionally raised and gnotobiotic mice due to our active funding from the NIH, which has clear guidelines meant to prevent continued discrimination against studies in females. The following link has additional information for your consideration: https://orwh.od.nih.gov/sex-gender/nih-policy-sex-biological-variable.

      Importantly, our study was not confounded by sex due to the use of similar numbers of male and female donors (2 male and 2 females in the LFPP experiments and 3 female and 2 males for both ethnicities in the HFHS experiment). All of our recipient mice were male, as specified in our methods section and our revised main text:

      “To experimentally test this hypothesis, we transplanted the gut microbiotas of two representative lean W and lean EA individuals into germ-free male C57BL/6J mice…Next, we sought to assess the reproducibility of these findings across multiple donors and in the context of a distinctive dietary pressure. We fed 20 germ-free male mice a high-fat, high-sugar (HFHS) diet for 4 weeks prior to colonization with a gut microbiota from one of 5 W and 5 EA donors....”

      To further investigate any potential sex-specific signal we have stratified our analysis for the HFHS experiment by the gender of the donors (Reviewer Figure 2). This reveals that the significance between ethnicity in the microbiota transplantation experiments is preserved in mice that received stool from male donors (Reviewer Fig. 2A) but not female donors (Reviewer Fig. 2B). In Reviewer Fig. 1 above, LFPP1 and LFPP2 were conducted using different donors of different biological sex. Splitting our LFPP experiments up revealed the consistent signal for ethnicity in microbial community composition that we report above. The small sample sizes in this stratified analysis makes it difficult to conclude that there are reproducible sex-specific differences in the microbiome transplant experiments, but we agree with the reviewer that this question should be more thoroughly explored in future work.

      We have added a brief note to the discussion to emphasize this important point:

      “...differences between the human donor and recipient mouse microbiotas inherent to gnotobiotic transplantation warrant further investigation as do differences in the stability of the gut microbiotas of male versus female donors”

      Reviewer Figure 2. (A,B) Principal coordinate analysis of PhILR Euclidean distances of stool from germ-free recipient mice transplanted with stool microbial communities from (A) male (n=2 EA and n=2 W donors) or (B) female (n=3 EA and n=3 W) donors of either ethnicity and fed a HFHS diet. Significance was assessed by ADONIS. Pairs of germ-free mice receiving the same donor sample are connected by a dashed line (n=2 recipient mice per donor). Experimental designs are shown in Fig. S7.

      Finally, experimental results are not always consistent and sometimes show opposite trends that may be related to the sampling sizes. For instance, fat and lean mass increased and decreased respectively in LFPP, but there were no statistically-similar differences in HFHS. Moreover, the metabolic fat mass outcomes in mice do not match the expected human donor data. For instance, in LFPP1, White subjects had lower fat mass in humans but recipient mice on average gained more fat. It is difficult to reconcile these differences to a biological or sampling scheme reason.

      We wholeheartedly agree with this point and were also surprised that the recipient mouse phenotypes did not match our original hypothesis based upon the observed health disparities between EA and W individuals. These surprising and perhaps counter-intuitive results demand further study and mechanistic dissection. We have tried to capture potential explanations for these findings while highlighting the limitations of our current study in our expanded discussion. With respect to the glucose tolerance data, the lack of a microbiome-driven phenotype might be due to the use of genetically identical mice that are not prone to metabolic illness without significant perturbation. If we had used mice prone to metabolic disease, such as non-obese diabetic (NOD) germ free recipient mice where the microbiome is known to impact the development of diabetes, we may have seen between ethnic differences in glucose tolerance.

      Our revised discussion, with key points underlined is copied below for your convenience:

      “Our results in humans and mouse models support the broad potential for downstream consequences of ethnicity-associated differences in the gut microbiome for metabolic syndrome and potentially other disease areas. However, the causal relationships and how they can be understood in the context of the broader differences in host phenotype between ethnicities require further study. While these data are consistent with our general hypothesis that ethnicity-associated differences in the gut microbiome are a source of differences in host metabolic disease risk, we were surprised by both the nature of the microbiome shifts and their directionality. Based upon observations in the IDEO (Alba et al., 2018) and other cohorts (Gu et al., 2006; Zheng et al., 2011), we anticipated that the gut microbiomes of lean EA individuals would promote obesity or other features of metabolic syndrome. In humans, we did find multiple signals that have been previously linked to obesity and its associated metabolic diseases in EA individuals, including increased Firmicutes (Basolo et al., 2020; Bisanz et al., 2019), decreased A. muciniphila (Depommier et al., 2019; Plovier et al., 2017), decreased diversity (Turnbaugh et al., 2009a), and increased acetate (Perry et al., 2016; Turnbaugh et al., 2006). Yet EA subjects also had higher levels of Bacteroidota and Bacteroides, which have been linked to improved metabolic health (Johnson et al., 2017). More importantly, our microbiome transplantations demonstrated that the recipients of the lean EA gut microbiome had less body fat despite consuming the same diet. These seemingly contradictory findings may suggest that the recipient mice lost some of the microbial features of ethnicity relevant to host metabolic disease or alternatively that the microbiome acts in a beneficial manner to counteract other ethnicity-associated factors driving disease.

      EA subjects also had elevated levels of the short-chain fatty acids propionate and isobutyrate. The consequences of elevated intestinal propionate levels are unclear given the seemingly conflicting evidence in the literature that propionate may either exacerbate (Tirosh et al., 2019) or protect from (Lu et al., 2016) aspects of metabolic syndrome. Clinical data suggests that circulating propionate may be more relevant for disease than fecal levels (Müller et al., 2019), emphasizing the importance of considering both the specific microbial metabolites produced, their intestinal absorption, and their distribution throughout the body. Isobutyrate is even less well-characterized, with prior links to dietary intake (Berding and Donovan, 2018) but no association with obesity (Kim et al., 2019). Unlike SCFAs, we did not identify consistent differences in BCAAs, potentially due to differences in both extraction and standardization techniques inherent to GC-MS and NMR analysis (Cai et al., 2016; Lynch and Adams, 2014; Qin et al., 2012).

      There are multiple limitations of this study. Due to the investment of resources into ensuring a high level of phenotypic information on each cohort member coupled to the restricted geographical catchment area, the IDEO cohort was relatively small at the time of this analysis (n=46 individuals). The current study only focused on two of the major ethnicities in the San Francisco Bay Area. As IDEO continues to expand and diversify its membership, we hope to study a sufficient number of participants from other ethnic groups. Stool samples were collected at a single time point and analyzed in a cross-sectional manner. While we used validated tools from the field of nutrition to monitor dietary intake, we cannot fully exclude subtle dietary differences between ethnicities (Johnson et al., 2019), which could be interrogated through controlled feeding studies (Basolo et al., 2020). Our mouse experiments were all performed in wild-type adult males. The use of a microbiome-dependent transgenic mouse model of diabetes (Brown et al., 2016) would be useful to test the effects of inter-ethnic differences in the microbiome on insulin and glucose tolerance. Additional experiments are warranted using the same donor inocula to colonize germ-free mice prior to concomitant feeding of multiple diets, allowing a more explicit test of the hypothesis that diet can disrupt ethnicity-associated microbial signatures. These studies, coupled to controlled experimentation with individual strains or more complex synthetic communities, would help to elucidate the mechanisms responsible for ethnicity-associated changes in host physiology and their relevance to disease.”

      Reviewer #3 (Public Review):

      The authors aimed to characterise how gut microbiota changes between different ethnic group for bacterial richness and community structure. They also wanted to address how this is associated with ethnic group within a defined geographical location. They have started to their story by comparing the fecal microbiota of relatively small cohort consisting of 46 lean and obese East Asian and White participants living in the San Francisco Bay Area. For that reason they used 16S and shotgun metagenomics. They demonstrated that ethnicity-associated differences in the gut microbiota are stronger in lean individuals and obese did not have a clear difference in the gut microbiota profile between ethnic groups, either suggesting that established obesity or its associated dietary patterns can overwrite long-lasting microbial signatures or alternatively that there is a shared ethnicity-independent microbiome type that predisposes individuals to obesity. The authors did also show the metabolic differences between these ethnic groups and the major differences were in the branched chain amino acid and the short-chain fatty acids. To prove their point, at this stage they have also used different metabolomic methodology. Although some aspects of the work are not very novel, the work does provide additional insights into the effect(s) of ethnicity, current living location and diet on shaping microbiota. Honestly, while reading through the manuscript, I have several questions where I believed that clarification was needed. But somehow, I felt like the authors have been reading my mind every step of the way. At the end of each section whatever I questioned was addressed in the next paragraph There are, however, a few points that I think would like to hear the authors' clarification.

      • The authors pursued the story using 16S data. However, they have shotgun Metagenomics data which gives more power and resolution to microbiota profile. Is there any specific reason why the story was not build with shotgun Metagenomic data? However, if this is the case it will be nice to justify in the text or legend which figure was built with what dataset exactly?

      As discussed above, 16S rRNA gene and metagenomic sequencing both have strengths and weaknesses. For example, 16S-seq is inexpensive and allows analysis of low abundance species, whereas metagenomics permits analysis of gene and pathway abundances of abundant taxa. As requested, we have now expanded Figure 2 (metagenomics) to better match Figure 1 (16S-seq). The type of technology is defined within each legend and the relevant text within our results.

      • Even though the authors mentioned in the discussion that they have not used the same inocula from a donor to different diet, it will be nice if the authors further comments whether they would expect the same results or slightly different results which each different inocula.

      As requested, we have modified the text in our discussion to include these comments:

      “Additional experiments are warranted using the same donor inocula to colonize germ-free mice prior to concomitant feeding of multiple diets, allowing a more explicit test of the hypothesis that diet can disrupt ethnicity-associated microbial signatures. These studies, coupled to controlled experimentation with individual strains or more complex synthetic communities, would help to elucidate the mechanisms responsible for ethnicity-associated changes in host physiology and their relevance to disease.”

      Overall, the study is well executed and claims and conclusions seem relatively well justified by the provided evidence. The findings are interesting for a broad audience of biologists. The findings are interesting for a broad audience of biologists.

    1. Author Response:

      Reviewer #1 (Public Review):

      Overall, the authors have done a nice job covering the relevant literature, presenting a story out of complicated data, and performing many thoughtful analyses.

      However, I believe the paper requires quite major revisions.

      We thank the reviewer for their encouraging assessment of our manuscript. We are grateful for their valuable and especially detailed feedback that helped us to substantially improve our manuscript.

      Major issues:

      I do not believe the current results present a clear, comprehensible story about sleep and motor memory consolidation. As presented, sleep predicts an increase in the subsequent learning curve, but there is a negative relationship between learning curve and task proficiency change (which is, as far as I can tell, similar to "memory retention"). This makes it seem as if sleep predicts more forgetting on initial trials within the subsequent block (or worse memory retention) - is this true? Regardless of whether it is statistically true, there appears another story in these data that is being sacrificed to fit a story about sleep. To my eye, the results may first and foremost tell a circadian (rather than sleep) story. Examining the data in Figure 2A and 2B, it appears that every AM learning period has a higher learning curve (slope) than every PM period. While this could, of course, be due to having just slept, the main story gleaned from such a result is not a sleep effect on retention, which has been the emphasis on motor memory consolidation research in the last couple of decades, but on new learning. The fact that this effect appears present in the first session (juggling blocks 1-3 in adolescents and blocks 1-5 in adults) makes this seem the more likely story here, since it has less to do with "preparing one to re-learn" and more to do with just learning and when that learning is optimal. But even if it does not reach statistical significance in the first session alone, it remains a concern and, in my opinion, should be considered a focus in the manuscript unless the authors can devise a reason to definitively rule it out.

      Here is how I recommend the authors proceed on this point: include all sessions from all subjects into a mixed effect model, predicting the slope of the learning curve with time of day and age group as fixed effects and subjects as random effects:

      learning curve slope ~ AM/PM [AM (0) or PM (1)] + age [adolescent (0) or adult (1)] + (1|subject)

      …or something similar with other regressors of interest. If this is significant for AM/PM status, they should re-try the analysis using only the first session. If this is significant, then a sleep-centric story cannot be defended here at all, in my opinion. If it is not (which could simply result from low power, but the authors could decide this), the authors should decide if they think they can rule out circadian effects and proceed accordingly. I should note that, while to many, a sleep story would be more interesting or compelling, that is not my opinion, and I would not solely opt to reject this paper if it centered a time-of-day story instead.

      The authors need to work out precisely what is happening in the behavior here, and let the physiology follow that story. They should allow themselves to consider very major revisions (and drop the physiology) if that is most consistent with the data. As presented, I am very unclear of what to take away from the study.

      We thank the reviewer for the opportunity to further elaborate on our behavioral results. We agree that the interpretation of the behavior in the complex gross-motor task is not straight forward, which might be partly due to less controllability compared to for example finger-tapping tasks. The reviewer is correct that, initially sleep seems to predict more forgetting on initial trials within the subsequent block given the dip in task proficiency and a resulting increase in steepness of the learning curve after the sleep retention interval. Notably, this dip in performance after sleep has also been reported for finger-tapping tasks (cf. Eichenlaub et al, 2020). The performance dip is also present in the wake first group (Figure 2) after the first interval. This observation suggests that picking up the task again after a period of time comes at a cost. Interestingly, this performance dip is no longer present after the second retention interval indicating that the better the task proficiency the easier it is to pick up juggling again. In other words, juggling has been better consolidated after additional training. Critically, our results show, that participants with higher SO-spindle coupling strength have a lower dip in performance after the retention interval, thus indicating a learning advantage.

      Figure 2

      (A) Number of successful three-ball cascades (mean ± standard error of the mean [SEM]) of adolescents (circles) for the sleep-first (blue) and wake-first group (green) per juggling block. Grand average learning curve (black lines) as computed in (C) are superimposed. Dashed lines indicate the timing of the respective retention intervals that separate the three performance tests. Note that adolescents improve their juggling performance across the blocks. (B) Same conventions as in (A) but for adults (diamonds). Similar to adolescents, adults improve their juggling performance across the blocks regardless of group.

      We discuss the sleep effect on juggling in the discussion section (page 22 – 23, lines 502 – 514):

      "How relevant is sleep for real-life gross-motor memory consolidation? We found that sleep impacts the learning curve but did not affect task proficiency in comparison to a wake retention interval (Figure 2DE). Two accounts might explain the absence of a sleep effect on task proficiency. (1) Sleep rather stabilizes than improves gross-motor memory, which is in line with previous gross-motor adaption studies (Bothe et al, 2019; Bothe et al, 2020). (2) Pre-sleep performance is critical for sleep to improve motor skills (Wilhelm et al, 2012). Participants commonly reach asymptotic pre-sleep performance levels in finger tapping tasks, which is most frequently used to probe sleep effects on motor memory. Here we found that using a complex juggling task, participants do not reach asymptotic ceiling performance levels in such a short time. Indeed, the learning progression for the sleep-first and wake-first groups followed a similar trend (Figure 2AB), suggesting that more training and not in particular sleep drove performance gains."

      If indeed the authors keep the sleep aspect of this story, here are some comments regarding the physiology. The authors present several nice analyses in Figure 3. However, given the lack of behavioral difference between adolescents and adults (Fig 2D), they combine the groups when investigating behavior-physiology relationships. In some ways, then, Figure 3 has extraneous details to the point of motor learning and retention, and I believe the paper would benefit from more focus. If the authors keep their sleep story, I believe Figure 3 and 4 should be combined and some current figure panels in Figure 3 should be removed or moved to the supplementary information.

      We thank the reviewers for their suggestion and we agree that the figures of our manuscript would benefit from more focus. Therefore, we combined Figure 3 and 4 from the original manuscript into a revised Figure 3 in the updated version of the manuscript. In more detail, subpanels that explain our methodological approach can now be found in Figure 3 – figure supplement 1, while the updated Figure 3 now focuses on developmental changes in oscillatory dynamics and SO-spindle coupling strength as well as their relationship to gross-motor learning.

      Updated Figure 3:

      (A) Left: topographical distribution of the 1/f corrected SO and spindle amplitude as extracted from the oscillatory residual (Figure 3 – figure supplement 1A, right). Note that adolescents and adults both display the expected topographical distribution of more pronounced frontal SO and centro-parietal spindles. Right: single subject data of the oscillatory residual for all subjects with sleep data color coded by age (darker colors indicate older subjects). SO and spindle frequency ranges are indicated by the dashed boxes. Importantly, subjects displayed high inter-individual variability in the sleep spindle range and a gradual spindle frequency increase by age that is critically underestimated by the group average of the oscillatory residuals (Figure 3 – figure supplement 1A, right). (B) Spindle peak locked epoch (NREM3, co-occurrence corrected) grand averages (mean ± SEM) for adolescents (red) and adults (black). Inset depicts the corresponding SO-filtered (2 Hz lowpass) signal. Grey-shaded areas indicate significant clusters. Note, we found no difference in amplitude after normalization. Significant differences are due to more precise SO-spindle coupling in adults. (C) Top: comparison of SO-spindle coupling strength between adolescents and adults. Adults displayed more precise coupling than adolescents in a centro-parietal cluster. T-scores are transformed to z-scores. Asterisks denote cluster-corrected two-sided p < 0.05. Bottom: Exemplary depiction of coupling strength (mean ± SEM) for adolescents (red) and adults (black) with single subject data points. Exemplary single electrode data (bottom) is shown for C4 instead of Cz to visualize the difference. (D) Cluster-corrected correlations between individual coupling strength and overnight task proficiency change (post – pre retention) for adolescents (red, circle) and adults (black, diamond) of the sleep-first group (left, data at C4). Asterisks indicate cluster-corrected two-sided p < 0.05. Grey-shaded area indicates 95% confidence intervals of the trend line. Participants with a more precise SO-spindle coordination show improved task proficiency after sleep. Note that the change in task proficiency was inversely related to the change in learning curve (cf. Figure 2D), indicating that a stronger improvement in task proficiency related to a flattening of the learning curve. Further note that the significant cluster formed over electrodes close to motor areas. (E) Cluster-corrected correlations between individual coupling strength and overnight learning curve change. Same conventions as in (D). Participants with more precise SO-spindle coupling over C4 showed attenuated learning curves after sleep.

      and

      Figure 3 - figure supplement 1

      (A) Left: Z-normalized EEG power spectra (mean ± SEM) for adolescents (red) and adults (black) during NREM sleep in semi-log space. Data is displayed for the representative electrode Cz unless specified otherwise. Note the overall power difference between adolescents and adults due to a broadband shift on the y-axis. Straight black line denotes cluster-corrected significant differences. Middle: 1/f fractal component that underlies the broadband shift. Right: Oscillatory residual after subtracting the fractal component (A, middle) from the power spectrum (A, left). Both groups show clear delineated peaks in the SO (< 2 Hz) and spindle range (11 – 16 Hz) establishing the presence of the cardinal sleep oscillations in the signal. (B) Top: Spindle frequency peak development based on the oscillatory residuals. Spindle frequency is faster at all but occipital electrodes in adults than in adolescents. T-scores are transformed to z-scores. Asterisks denote cluster-corrected two-sided p < 0.05. Bottom: Exemplary depiction of the spindle frequency (mean ± SEM) for adolescents (red) and adults (black) with single subject data points at Cz. (C) SO-spindle co-occurrence rate (mean ± SEM) for adolescents (red) and adults (black) during NREM2 and NREM3 sleep. Event co-occurrence is higher in NREM3 (F(1, 51) = 1209.09, p < 0.001, partial eta² = 0.96) as well as in adults (F(1, 51) = 11.35, p = 0.001, partial eta² = 0.18). (D) Histogram of co-occurring SO-spindle events in NREM2 (blue) and NREM3 (purple) collapsed across all subjects and electrodes. Note the low co-occurring event count in NREM2 sleep. (E) Single subject (top) and group averages (bottom, mean ± SEM) for adolescents (red) and adults (black) of individually detected, for SO co-occurrence-corrected sleep spindles in NREM3. Spindles were detected based on the information of the oscillatory residual. Note the underlying SO-component (grey) in the spindle detection for single subject data and group averages indicating a spindle amplitude modulation depending on SO-phase. (F) Grand average time frequency plots (-2 to -1.5s baseline-corrected) of SO-trough-locked segments (corrected for spindle co-occurrence) in NREM3 for adolescents (left) and adults (right). Schematic SO is plotted superimposed in grey. Note the alternating power pattern in the spindle frequency range, showing that SO-phase modulates spindle activity in both age groups.

      Why did the authors use Spearman rather than Pearson correlations in Figure 4? Was it to reduce the influence of the outlier subject? They should minimally clarify and justify this, since it is less conventional in this line of research. And it would be useful to know if the relationship is significant with Pearson correlations when robust regression is applied. I see the authors are using MATLAB, and the robustfit toolbox (https://www.mathworks.com/help/stats/robustfit.html) is a simple way to address this issue.

      We thank the reviewers for their suggestion. We agree that when inspecting the scatter plots it looks like that the correlations could be severely influenced by two outliers in the adult group. Because this is an important matter, we recalculated all previously reported correlations without the two outliers (Figure R4, left column) and followed the reviewer’s suggestion to also compute robust regression (Figure R4, right column) and found no substantial deviation from our original results.

      In more detail, increase in task proficiency resulted in flattening of the learning curve when removing outliers (Figure R4A, rhos = -0.70, p < 0.001) and when applying robust regression analysis (Figure R4B, b = -0.30, t(67) = -10.89, rho = -0.80, p < 0.001). Likewise, higher coupling strength still predicted better task proficiency (mean rho = 0.35, p = 0.029, cluster-corrected) and flatter learning curves after sleep (rho = -0.44, p = 0.047, cluster-corrected) when removing the outliers (Figure R4CE) and when calculating robust regression (Figure R4DF, task proficiency: b = 82.32, t(40) = 3.12, rho = 0.45, p = 0.003; learning curve: b = -26.84, t(40) = -2.96, rho = -0.43, p = 0.005). Furthermore, we calculated spearman rank correlations and cluster-corrected spearman rank correlations in our original manuscript, to mitigate the impact of outliers, even though Pearson correlations are more widely used in the field. Therefore, we still report spearman rank correlations for single electrodes instead of robust correlations as it is more consistent with the cluster-correlation analyses.

      We now use robust trend lines instead of linear trend lines in our scatter plots. Further, we added the correlations without outliers (Figure R4ACE) to the supplements as Figure 2 – figure supplement 1D and Figure 3 – figure supplement 2 FG. These additional analyses are now reported in the results section of the revised manuscript (page 9, lines 186 – 191):

      "[…] we confirmed a strong negative correlation between the change (post retention values – pre retention values) in task proficiency and the change in learning curve after the retention interval (Figure 2F; rhos = -0.71, p < 0.001), which also remained strong after outlier removal (Figure 2 – figure supplement 1D). This result indicates that participants who consolidate their juggling performance after a retention interval show slower gains in performance."

      And (page 16, lines 343 – 346):

      "[…] Furthermore, our results remained consistent when including coupled spindle events in NREM2 (Figure 3 – figure supplement 2E) and after outlier removal (Figure 3 – figure supplement 2FG)."

      Furthermore, we now state that we specifically utilized spearman rank correlations to mitigate the impact of outliers in our analyses in the method section (page 35, lines 808 – 813)::

      "For correlational analyses we utilized spearman rank correlations (rhos; Figure 2F & Figure 3DE) to mitigate the impact of possible outliers as well as cluster-corrected spearman rank correlations by transforming the correlation coefficients to t-values (p < 0.05) and clustering in the space domain (Figure 3DE). Linear trend lines were calculated using robust regression."

      Figure R4

      (A) Spearman rank correlation between task proficiency change and learning curve change collapsed across adolescents (red dot) and adults (black diamonds) after removing two outlier subjects in the adult age group. Grey-shaded area indicates 95% confidence intervals of the robust trend line. (B) Robust regression of task proficiency change and learning curve change of the original sample. (C) Cluster-corrected correlations (right) between individual coupling strength and overnight task proficiency change (post – pre retention) after outlier removal (left, spearman correlation at C4, uncorrected). Asterisks indicate cluster-corrected two-sided p < 0.05. (D) Robust regression of coupling strength at C4 and task proficiency of the original sample. (E) Same conventions as in (C) but for overnight learning curve change. (F) Same conventions as in (D) but for overnight learning curve change.

      Additionally, with only a single night of recording data, it is impossible to disentangle possible trait-based sleep characteristics (e.g., Subject 1 has high SO-spindle coupling in general and retains motor memories well, but these are independent of each other) from a specific, state-based account (e.g., Subject 1's high SO-spindle coupling on night 1 specifically led to their improved retention or change in learning, etc., and this is unrelated to their general SO-spindle coupling or motor performance abilities). Clearly, many studies face this limitation, but this should be acknowledged.

      We thank the reviewers for their important remark. We agree that it is impossible to make a sound statement about whether our reported correlations represent trait- or state-based aspects of the sleep and learning relationship with the data that we have reported in the manuscript. However, while we are lacking a proper baseline condition without any task engagement, we still recorded polysomnography for all subjects during an adaptation night. Given the expected pronounced differences in sleep architecture between the adaptation nights and learning nights (see Table R3 for an overview collapsed across both age groups), we initially refrained from entering data from the adaptation nights into our original analyses, but we now fully report the data below. Note that the differences are driven by the adaptation night, where subjects first have to adjust to sleeping with attached EEG electrodes in a sleep laboratory.

      Table R3. Sleep architecture (mean ± standard deviation) for the adaptation and learning night collapsed across both age groups. Nights were compared using paired t-tests

      To further clarify whether subjects with high coupling strength have a motor learning advantage (i.e. trait-effect) or a learning induced enhancement of coupling strength is indicative for improved overnight memory change (i.e. state-effect), we ran additional analyses using the data from the adaptation night. Note that the coupling strength metric was not impacted by differences in event number and our correlations with behavior were not influenced by sleep architecture (please refer to our answer of issue #7 for the results).Therefore, we considered it appropriate to also utilize data from the adaptation night.

      First, we correlated SO-spindle coupling strength obtained from the adaptation night with the coupling strength in the learning night. We found that overall, coupling strength is highly correlated between the two measurements (mean rho across all channels = 0.55, Figure R5A), supporting the notion that coupling strength remains rather stable within the individual (i.e. trait), similar to what has been reported about the stable nature of sleep spindles as a “neural finger-print” (De Gennaro & Ferrara, 2003; De Gennaro et al, 2005; Purcell et al, 2017).

      To investigate a possible state-effect for coupling strength and motor learning, we calculated the difference in coupling strength between the two nights (learning night – adaptation night) and correlated these values with the overnight change in task proficiency and learning curve. We identified no significant correlations with a learning induced coupling strength change; neither for task proficiency nor learning curve change (Figure R5B). Note that there was a positive correlation of coupling strength change with overnight task proficiency change at Cz (Figure R5B, left), however it did not survive cluster-corrected correlational analysis (rhos = 0.34, p = 0.15). Combined, these results favor the conclusion that our correlations between coupling strength and learning rather reflect a trait-like relationship than a state-like relationship. This is in line with the interpretation of our previous studies that SO-spindle coupling strength reflects the efficiency and integrity of the neuronal pathway between neocortex and hippocampus that is paramount for memory networks and the information transfer during sleep (Hahn et al, 2020; Helfrich et al, 2019; Helfrich et al, 2018; Winer et al, 2019). For a comprehensive review please see Helfrich et al (2021), which argued that SO-spindle coupling predicts the integrity of memory pathways and therefore correlates with various metrics of behavioral performance or structural integrity.

      Figure R5

      (A) Topographical plot of spearman rank correlations of coupling strength in the adaptation night and learning night across all subjects. Overall coupling strength was highly correlated between the two measurements. (B) Cluster-corrected correlation between learning induced coupling strength changes (learning night – adaptation night) and overnight change in task proficiency (left) as well as learning curve (right). We found no significant clusters, although correlations showed similar trends as our original analyses, with more learning induced changes in coupling strength resulting in better overnight task proficiency and flattened learning curves.

      We have now added the additional state-trait analyses (Figure R5) to the updated manuscript as Figure 3 – figure supplement 2HI and report them in the results section (page 17, lines 361 – 375):

      "Finally, we investigated whether subjects with high coupling strength have a gross-motor learning advantage (i.e. trait-effect) or a learning induced enhancement of coupling strength is indicative for improved overnight memory change (i.e. state-effect). First, we correlated SO-spindle coupling strength obtained from the adaptation night with the coupling strength in the learning night. We found that overall, coupling strength is highly correlated between the two measurements (mean rho across all channels = 0.55, Figure 3 – figure supplement 2H), supporting the notion that coupling strength remains rather stable within the individual (i.e. trait). Second, we calculated the difference in coupling strength between the learning night and the adaptation night to investigate a possible state-effect. We found no significant cluster-corrected correlations between coupling strength change and task proficiency- as well as learning curve change (Figure 3 – figure supplement 2I).

      Collectively, these results indicate the regionally specific SO-spindle coupling over central EEG sensors encompassing sensorimotor areas precisely indexes learning of a challenging motor task."

      We further refer to these new results in the discussion section (page 23, lines 521 – 528):

      "Moreover, we found that SO-spindle coupling strength remains remarkably stable between two nights, which also explains why a learning-induced change in coupling strength did not relate to behavior (Figure 3 – figure supplement 2I). Thus, our results primarily suggest that strength of SO-spindle coupling correlates with the ability to learn (trait), but does not solely convey the recently learned information. This set of findings is in line with recent ideas that strong coupling indexes individuals with highly efficient subcortical-cortical network communication (Helfrich et al, 2021)."

      Additionally, we now provide descriptive data of the adaptation and learning night (Table R3) in the Supplementary file – table 1 and explicitly mention the adaptation night in the results section, which was previously only mentioned in the method section(page 6, lines 101 – 105):.

      "Polysomnography (PSG) was recorded during an adaptation night and during the respective sleep retention interval (i.e. learning night) except for the adult wake-first group (for sleep architecture descriptive parameters of the adaptation night and learning night as well as for adolescents and adults see Supplementary file – table 1 & 2)."

      Reviewer #2 (Public Review):

      In this study Hahn and colleagues investigate the role of Slow-oscillation spindle coupling for motor memory consolidation and the impact of brain maturation on these interactions. The authors employed a real-life gross-motor task, where adolescents and adults learned to juggle. They demonstrate that during post-learning sleep SO-spindles are stronger coupled in adults as compared to adolescents. The authors further show, that the strength of SO-spindle coupling correlates with overnight changes in the learning curve and task proficiency, indicating a role of SO-spindle coupling in motor memory consolidation.

      Overall, the topic and the results of the present study are interesting and timely. The authors employed state of the art analyse carefully taking the general variability of oscillatory features into account. It also has to be acknowledged that the authors moved away from using rather artificial lab-tasks to study the consolidation of motor memories (as it is standard in the field), adding ecological validity to their findings. However, some features of their analyses need further clarification.

      We thank the reviewer for their positive assessment of our manuscript. Incorporating the encouraging and helpful feedback, we believe that we substantially improved the clarity and robustness of our analyses.

      1) Supporting and extending previous work of the authors (Hahn et al, 2020), SO-spindle coupling over centro-parietal areas was stronger in adults as compared to adolescents. Despite these differences in the EEG results the authors collapsed the data of adults and adolescents for their correlational analyses (Fig. 4a and 4b). Why would the authors think that this procedure is viable (also given the fact that different EEG systems were used to record the data)?

      We thank the reviewers for the opportunity to clarify why we think it is viable to collapse the data of adolescents and adults for our correlational analyses. In the following we split our answers based on the two points raised by the reviewers: (1) electrophysiological differences (i.e. coupling strength) between the groups and (2) potential signal differences due to different EEG systems.

      1. Electrophysiological differences

      Upon inspecting the original Figure 4, it is apparent that the coupling strength of the combined sample does not form isolated clusters for each age group. In other words, while adult coupling strength is on the higher and adolescent coupling on the lower end due to the developmental increase in coupling strength we reported in the original Figure 3F, both samples overlap forming a linear trend. Second, when running the correlational analyses between coupling strength and task proficiency as well as learning curve separately for each age group, we found that they follow the same direction (Figure R3). Adolescents with higher coupling strength show better task proficiency (Figure R3A, rhos = 0.66, p = 0.005). This effect was also present when using robust regression (b = 109.97, t(15)=3.13, rho = 0.63, p = 0.007). Like adolescents, adults with higher coupling strength at C4 displayed better task proficiency after sleep (Figure R3B, rhos = 0.39, p = 0.053). This relationship was stronger when using robust regression (b = 151.36, t(23)=3.17, rho =0.56, p = 0.004). For learning curves, we found the expected negative correlation at C4 for adolescents (Figure R3C, rhos = -0.57, p = 0.020) and adults (Figure R3D, rhos = -0.44, p = 0.031). Results were comparable when using robust regression (adolescents: b = -59.58, t(15) = -2.94, rho = -0.60, p = 0.010; adults: b = -21.99, t(23 )= -1.71, rho = -0.37, p = 0.101).

      Taken together, these results demonstrate that adolescents and adults show the effects and the same direction at the same electrode, thus, making it highly unlikely that our results are just by chance and that our initial correlation analyses are just driven by one group.

      Additionally, we already controlled for age in our original analyses using partial correlations (also refer to our answer to issue #6). Hence, our additional analyses provide additional support that it is viable to collapse the analyses across both age groups even though they differ in coupling strength.

      1. Different EEG-systems

        The reviewers also raise the question whether our analyses might be impacted by the different EEG systems we used to record our data. This is an important concern especially when considering that cross-frequency coupling analyses can be severely confounded by differences in signal properties (Aru et al, 2015). In our sample, the strongest impact factor on signal properties is most likely age, given the broadband power differences in the power spectrum we found between the groups (original Figure 3A). Importantly, we also found a similar systematic power difference in our longitudinal study using the same ambulatory EEG system for both data recordings (Hahn et al, 2020). This is in line with numerous other studies demonstrating age related EEG power changes in broadband- as well as SO and sleep spindle frequency ranges (Campbell & Feinberg, 2016; Feinberg & Campbell, 2013; Helfrich et al, 2018; Kurth et al, 2010; Muehlroth et al, 2019; Muehlroth & Werkle-Bergner, 2020; Purcell et al, 2017). Therefore, we already had to take differences in signal property into account for our cross-frequency analyses. Regardless whether the underlying cause is an age difference or different signal-to-noise ratios of different EEG systems.

      To mitigate confounds in the signal, we used a data-driven and individualized approach detecting SO and sleep spindle events based on individualized frequency bands and a 75-percentile amplitude criterion relative to the underlying signal. Additionally we z-normalized all spindle events prior to the cross-frequency coupling analyses (Figure R3E). We found no amplitude differences around the spindle peak (point of SO-phase readout) between adolescents that were recorded with an ambulatory amplifier system (alphatrace) and adults that were recorded with a stationary amplifier system (neuroscan) using cluster-based random permutation testing. This was also the case for the SO-filtered (< 2 Hz) signal (Figure R3E, inset). Critically, the significant differences in amplitude from -1.4 to -0.8 s (p = 0.023, d = -0.73) and 0.4 to 1.5 s (p < 0.001, d = 1.1) are not caused by age related differences in power or different EEG-systems but instead by the increased coupling strength (i.e. higher coupling precision of spindles to SOs) in adults giving rise to a more pronounced SO-wave shape when averaging across spindle peak locked epochs.

      Consequently, our analysis pipeline already controlled for possible differences in signal property introduced through different amplifier systems. Nonetheless, we also wanted to directly compare the signal-to-noise ratio of the ambulatory and stationary amplifier systems. However, we only obtained data from both amplifier systems in the adult sleep first group, because we recorded EEG during the juggling learning phase with the ambulatory system in addition to the PSG with the stationary system. First, we computed the power spectra in the 1 to 49 Hz frequency range during the juggling learning phase (ambulatory) and during quiet wakefulness (stationary) for every subject in the adult sleep first group in 10-seconds segments. Next, we computed the signal-to-noise ratio (mean/standard deviation) of the power spectra per frequency across all segments. We only found a small negative cluster from 21.9 to 22.5 Hz (p = 0.042, d = 0.53; Figure R3F), which did not pertain our frequency-bands of interest. Critically, the signal-to-noise ratio of both amplifiers converged in the upper frequency bands approaching the noise floor, therefore, strongly supporting the notion that both systems in fact provided highly comparable estimates.

      In conclusion, both age groups display highly similar effects and direction when correlating coupling strength with behavior. Further, after individualization and normalization the analytical signal, we found no differences in signal properties that would confound the cross-frequency analysis. Lastly, we did not find systematic differences in signal-to-noise ratio between the different EEG-systems. Thus, we believe it is justified to collapse the data across all participants for the correlational analyses, as it combines both, the developmental aspect of enhanced coupling precision from adolescence to adulthood and the behavioral relevance for motor learning which we deem a critical research advance from our previous study.

      Figure R3

      (A) Cluster-corrected correlations (right) between individual coupling strength and overnight task proficiency change (post – pre retention) for adolescents of the sleep-first group (left, spearman correlation at C4, uncorrected). Asterisks indicate cluster-corrected two-sided p < 0.05. Grey-shaded area indicates 95% confidence intervals of the robust trend line. Participants with a more precise SO-spindle coordination show improved task proficiency after sleep. (B) Cluster-corrected correlation of coupling strength and overnight task proficiency change) for adults. Same conventions as in (A). Similar trend of higher coupling strength predicting better task proficiency after sleep (C) Cluster-corrected correlation of coupling strength and overnight learning curve change for adolescents. Same conventions as in (A). Higher coupling strength related to a flatter learning curve after sleep. (D) Cluster-corrected correlation of coupling strength and overnight learning curve change for adults. Same conventions as in (A). Higher coupling strength related to a flatter learning curve after sleep. (E) Spindle peak locked epoch (NREM3, co-occurrence corrected) grand averages (mean ± SEM) for adolescents (red) and adults (black). Inset depicts the corresponding SO-filtered (2 Hz lowpass) signal. Black lines indicate significant clusters. Note, we found no difference in amplitude after normalization. Significant differences are due to more precise SO-spindle coupling in adults. Spindle frequency is blurred due to individualized spindle detection. (F) Signal-to-noise ratio for the stationary EEG amplifier (green) during quiet wakefulness and for the ambulatory EEG amplifier (purple) during juggling training. Grey shaded area denotes cluster-corrected p < 0.05. Note that signal-to-noise ratio converges in the higher frequency ranges.

      We have now added Figure R3E as Figure 3B to the revised version of the manuscript to demonstrate that there were no systematic differences between the two age groups in the analytical signal due to the expected age related power differences or EEG-systems. Specifically, we now state in the results section (page 13 – 14, lines 282 – 294):

      "We assessed the cross frequency coupling based on z-normalized spindle epochs (Figure 3B) to alleviate potential power differences due to age (Figure 3 – figure supplement 1A) or different EEG-amplifier systems that could potentially confound our analyses (Aru et al, 2015). Importantly, we found no amplitude differences around the spindle peak (point of SO-phase readout) between adolescents and adults using cluster-based random permutation testing (Figure 3B), indicating an unbiased analytical signal. This was also the case for the SO-filtered (< 2 Hz) signal (Figure 3B, inset). Critically, the significant differences in amplitude from -1.4 to -0.8 s (p = 0.023, d = -0.73) and 0.4 to 1.5 s (p < 0.001, d = 1.1) are not caused by age related differences in power or different EEG-systems but instead by the increased coupling strength (i.e. higher coupling precision of spindles to SOs) in adults giving rise to a more pronounced SO-wave shape when averaging across spindle peak locked epochs."

      Further, we added the correlational analyses that we computed separately for the age groups (Figure R3A-D) to the revised manuscript (Figure 3 – figure supplement 2CD) as they further substantiate our claims about the relationship between SO-spindle coupling and gross-motor learning.

      We now refer to these analyses in the results section (page 16, lines 338 – 343):

      "Critically, when computing the correlational analyses separately for adolescents and adults, we identified highly similar effects at electrode C4 for task proficiency (Figure 3 – figure supplement 2C) and learning curve (Figure 3 – figure supplement 2D) in each group. These complementary results demonstrate that coupling strength predicts gross-motor learning dynamics in both, adolescents as well as adults, and further show that this effect is not solely driven by one group."

      2) The authors might want to explicitly show that the reported correlations (with regards to both learning curve and task proficiency change) are not driven by any outliers.

      We thank the reviewers for their suggestion. We agree that when inspecting the scatter plots it looks like that the correlations could be severely influenced by two outliers in the adult group. Because this is an important matter, we recalculated all previously reported correlations without the two outliers (Figure R4, left column) and followed the reviewer’s suggestion to also compute robust regression (Figure R4, right column) and found no substantial deviation from our original results.

      In more detail, increase in task proficiency resulted in flattening of the learning curve when removing outliers (Figure R4A, rhos = -0.70, p < 0.001) and when applying robust regression analysis (Figure R4B, b = -0.30, t(67) = -10.89, rho = -0.80, p < 0.001). Likewise, higher coupling strength still predicted better task proficiency (mean rho = 0.35, p = 0.029, cluster-corrected) and flatter learning curves after sleep (rho = -0.44, p = 0.047, cluster-corrected) when removing the outliers (Figure R4CE) and when calculating robust regression (Figure R4DF, task proficiency: b = 82.32, t(40) = 3.12, rho = 0.45, p = 0.003; learning curve: b = -26.84, t(40) = -2.96, rho = -0.43, p = 0.005). Furthermore, we calculated spearman rank correlations and cluster-corrected spearman rank correlations in our original manuscript, to mitigate the impact of outliers, even though Pearson correlations are more widely used in the field. Therefore, we still report spearman rank correlations for single electrodes instead of robust correlations as it is more consistent with the cluster-correlation analyses.

      We now use robust trend lines instead of linear trend lines in our scatter plots. Further, we added the correlations without outliers (Figure R4ACE) to the supplements as Figure 2 – figure supplement 1D and Figure 3 – figure supplement 2 FG. These additional analyses are now reported in the results section of the revised manuscript (page 9, lines 186 – 191):

      "[…] we confirmed a strong negative correlation between the change (post retention values – pre retention values) in task proficiency and the change in learning curve after the retention interval (Figure 2F; rhos = -0.71, p < 0.001), which also remained strong after outlier removal (Figure 2 – figure supplement 1D). This result indicates that participants who consolidate their juggling performance after a retention interval show slower gains in performance."

      And (page 16, lines 343 – 346):

      "[…] Furthermore, our results remained consistent when including coupled spindle events in NREM2 (Figure 3 – figure supplement 2E) and after outlier removal (Figure 3 – figure supplement 2FG)."

      Furthermore, we now state that we specifically utilized spearman rank correlations to mitigate the impact of outliers in our analyses in the method section (page 35, lines 808 – 813)::

      "For correlational analyses we utilized spearman rank correlations (rhos; Figure 2F & Figure 3DE) to mitigate the impact of possible outliers as well as cluster-corrected spearman rank correlations by transforming the correlation coefficients to t-values (p < 0.05) and clustering in the space domain (Figure 3DE). Linear trend lines were calculated using robust regression."

      Figure R4:

      (A) Spearman rank correlation between task proficiency change and learning curve change collapsed across adolescents (red dot) and adults (black diamonds) after removing two outlier subjects in the adult age group. Grey-shaded area indicates 95% confidence intervals of the robust trend line. (B) Robust regression of task proficiency change and learning curve change of the original sample. (C) Cluster-corrected correlations (right) between individual coupling strength and overnight task proficiency change (post – pre retention) after outlier removal (left, spearman correlation at C4, uncorrected). Asterisks indicate cluster-corrected two-sided p < 0.05. (D) Robust regression of coupling strength at C4 and task proficiency of the original sample. (E) Same conventions as in (C) but for overnight learning curve change. (F) Same conventions as in (D) but for overnight learning curve change.

      3) The sleep data of all participants (thus from both sleep first and wake first) were used to determine the features of SO-spindle coupling in adolescents and adults. Were there any differences between groups (sleep first vs. wake first)? This might be in interesting in general but especially because only data of the sleep first group entered the subsequent correlational analyses.

      We thank the reviewers for their remark. We agree that adding additional information about possible differences between the sleep first and wake first groups would allow for a more comprehensive assessment of the reported data. We did not explain our reasoning to include only the sleep first groups for the correlation analyses clearly enough in the original manuscript. Unfortunately, we can only report data for the adolescents in our sample, because we did not record polysomnography (PSG) for the adult wake first group. This is also one of the two reasons why we focused on the sleep first groups for our correlational analyses.

      Adolescents in the sleep first group did not differ from adolescents in the wake first group in terms of sleep architecture (except REM (%), which did not correlate with behavior [task proficiency: rho = -0.17, p = 0.28; learning curve: -0.02, p = 0.90]) as well as SO and sleep spindle event descriptive measures (see Table R2). Importantly, we found no differences in coupling strength between the two groups (Figure R2A).

      Table R2. Summary of sleep architecture and SO/spindle event descriptive measures (at electrode C4) of adolescents in the sleep first and wake first group (mean ± standard deviation). Independent t-tests were used for comparisons

      The second reason why we focused our analyses on sleep first was that adolescents in the wake first group had higher task proficiency after the sleep retention interval than the sleep first group (Figure R2A; t(23) = -2.24, p = 0.034). This difference in performance is directly explained by the additional juggling test that the wake first group performed at the time point of their learning night, which should be considered as additional training. Therefore, we excluded the wake first group from our correlational analyses because sleep and wake first group are not comparable in terms of juggling training during the night when we assessed SO-spindle coupling strength.

      Figure R2

      (A) Comparison of SO-spindle coupling strength in the adolescent sleep first (blue) and wake first (green) group using cluster-based random permutation testing (Monte-Carlo method, cluster alpha 0.05, max size criterion, 1000 iterations, critical alpha level 0.05, two-sided). Left: exemplary depiction of coupling strength at electrode C4 (mean ± SEM). Right: z-transformed t-values plotted for all electrodes obtained from the cluster test. No significant clusters emerged. (B) Comparison of task proficiency between sleep first and wake first group after the sleep retention interval (mean ± SEM). Adolescents in the wake first group had higher task proficiency given the additional juggling performance test, which also reflects additional training.

      These additional analyses (Figure R2) and the summary statistics of sleep architecture and SO/spindle event descriptives of adolescents in the sleep first and wake first group (Table R2), are now reported in the revised version of the manuscript as Figure 3 – figure supplement 2AB and Supplementary file – table 7. We now explicitly explain our rationale of why we only considered participants in the sleep first group for our correlational analyses in the results section (page 6, lines 101 – 105):

      "Polysomnography (PSG) was recorded during an adaptation night and during the respective sleep retention interval (i.e. learning night) except for the adult wake-first group (for sleep architecture descriptive parameters of the adaptation night and learning night as well as for adolescents and adults see Supplementary file – table 1 & 2)"

      And (page 15, lines 311 – 320):

      "[…] Furthermore, given that we only recorded polysomnography for the adults in the sleep first group and that adolescents in the wake first group showed enhanced task proficiency at the time point of the sleep retention interval due to additional training (Figure 3 – figure supplement 2A), we only considered adolescents and adults of the sleep-first group to ensure a similar level of juggling experience adolescents and adults of the sleep-first group to ensure a similar level of juggling experience (for summary statistics of sleep architecture and SO and spindle events of subjects that entered the correlational analyses see Supplementary file – table 6). Notably, we found no differences in electrophysiological parameters (i.e. coupling strength, event detection) between the adolescents of the wake first and sleep first group (Figure 3 – figure supplement 2B & Supplementary file – table 7)."

      4) To allow a more comprehensive assessment of the underlying data information with regards to general sleep descriptives (minutes, per cent of time spent in different sleep stages, overall sleep time etc.) as well as related to SOs, spindles and coupled events (e.g. number, density etc.) would be needed.

      We agree with the reviewers that additional information about sleep architecture and SO as well as sleep spindle characteristics are needed for a more comprehensive assessment of our data. We now added summary tables for sleep architecture and SO/spindle event descriptive measures for the whole sample (Table R4) and for the sleep first groups that we used for our correlational analyses (Table R5) to the supplementary material in the updated manuscript. It is important to note, that due to the longer sleep opportunity of adolescents that we provided to accommodate the overall higher sleep need in younger participants, adolescents and adults differed in most general sleep architecture markers and SO as well as sleep spindle descriptive measures. In addition, changes in sleep architecture are prominent during the maturational phase from adolescence to adulthood, which might introduce additional variance between the two age groups.

      Table R4. Summary of sleep architecture and SO/spindle event descriptive measures (at electrode C4) of adolescents and adults across the whole sample (mean ± standard deviation) in the learning night. Independent t-tests were used for comparisons

      Table R5. Summary of sleep architecture and SO/spindle event descriptive measures (at electrode C4) of adolescents and adults in the sleep first group (mean ± standard deviation) in the learning night. Independent t-tests were used for comparisons

      In order to ensure that our correlational analyses are not driven by these systematic differences between the two age groups, we used cluster-corrected partial correlations to control for sleep architecture markers (Figure R7) and SO/spindle descriptive measurements (Figure R8A). Critically, none of these possible confounders changed the pattern of our initial correlational analyses of coupling strength and task proficiency/learning curve. Additionally, we also controlled for differences in spindle event number by using a bootstrapped resampling approach. We randomly drew 200 spindle events in 100 iterations and subsequently recalculated the coupling strength for each subject. We found that resampled values and our original observation of coupling strength are almost perfectly correlated, indicating that differences in event number are unlikely to have an impact on coupling strength as long as there are at least 200 events (Figure R8B). Combined these analyses demonstrate that our correlations between coupling strength and behavior are not influenced by the reported differences in sleep architecture and SO/spindle descriptive measures.

      Figure 7R

      Summary of cluster-corrected partial correlations of coupling strength with task proficiency (left) and learning curve (right) controlling for possible confounding factors. Asterisks indicate location of the detected cluster. The pattern of initial results remained highly stable.

      Figure R8

      (A) Summary of cluster-corrected partial correlations of coupling strength with task proficiency (left) and learning curve (right) controlling SO/spindle descriptive measures at critical electrode C4. Asterisks indicate location of the detected cluster. The pattern of initial results remained highly stable. (B) Spearman correlation between resampled coupling strength (N = 200, 100 iterations) and original observation of coupling strength for adolescents (red circles) and adults (black diamonds), indicating that coupling strength is not influenced by spindle event number if at least 200 events are present. Grey-shaded area indicates 95% confidence intervals of the robust trend line.

      We now provide general sleep descriptives (Table R4 & R5) in the revised version of the manuscript as Supplementary file – table 2 & table 6. These data are referred to in the results section (page 6, lines 101 – 105):

      "Polysomnography (PSG) was recorded during an adaptation night and during the respective sleep retention interval (i.e. learning night) except for the adult wake-first group (for sleep architecture descriptive parameters of the adaptation night and learning night as well as for adolescents and adults see Supplementary file – table 1 & 2)."

      And (page 15, lines 311 – 318):

      "Furthermore, given that we only recorded polysomnography for the adults in the sleep first group and that adolescents in the wake first group showed enhanced task proficiency at the time point of the sleep retention interval due to additional training (Figure 3 – figure supplement 2A), we only considered adolescents and adults of the sleep-first group to ensure a similar level of juggling experience (for summary statistics of sleep architecture and SO and spindle events of subjects that entered the correlational analyses see Supplementary file – table 6)."

      The additional control analyses (Figure R7 & R8) are also now added to the revised manuscript as Figure 3 – figure supplement 3 & 4 in the results section (page 16, lines 356 – 360):

      "For a summary of the reported cluster-corrected partial correlations as well as analyses controlling for differences in sleep architecture see Figure 3 – figure supplement 3. Further, we also confirmed that our correlations are not influenced by individual differences in SO and spindle event parameters (Figure 3 – figure supplement 4)."

      5) The authors used a partial correlations to rule out that age drove the relationship between coupling strength, learning curve and task proficiency. It seems like this analysis was done specifically for electrode C4, after having already established that coupling strength at electrode C4 correlates in general with changes in the learning curve and task proficiency. I think the claim that results were not driven by age as confounding factor would be stronger if the authors used a cluster-corrected partial correlation in the first place (just as in the main analysis).

      The reviewers are correct that initially we only conducted the partial correlation for electrode C4. Following the reviewers suggestion we now additionally computed cluster-corrected partial correlations similar to our main analysis. Like in our original analyses, we found a significant positive central cluster (Figure R6A, mean rho = 0.40, p = 0.017) showing that higher coupling strength related to better task proficiency after sleep and a negative cluster-corrected correlation at C4 showing that higher coupling strength was related to flatter learning curves after sleep (Figure R6B, rho = -0.47, p = 0.049) also when controlling for age.

      Figure R6

      (A) Cluster-corrected partial correlation of individual coupling strength in the learning night and overnight change in task proficiency (post – pre retention) collapsed across adolescents and adults, controlling for age. Asterisks indicate cluster-corrected two-sided p < 0.05. A similar significant cluster to the original analysis (Figure 4A) emerged comprising electrodes Cz and C4. (B) Same conventions as in A. Like in the original analysis (Figure 4B) a negative correlation between coupling strength at C4 and learning curve change survived cluster-corrected partial correlations when controlling for age.

      We now always report cluster-corrected partial correlations when controlling for possible confounding variables in the updated version of the manuscript (also see answer to issue #7). A summary of all computed partial correlations including Figure R6 can now be found as Figure 3 – figure supplement 3 & 4 in the revised manuscript.

      Specifically we now state in the results section (page 16 – 17, lines 347 – 360):

      "To rule out age as a confounding factor that could drive the relationship between coupling strength, learning curve and task proficiency in the mixed sample, we used cluster-corrected partial correlations to confirm their independence of age differences (task proficiency: mean rho = 0.40, p = 0.017; learning curve: rhos = -0.47, p = 0.049). Additionally, given that we found that juggling performance could underlie a circadian modulation we controlled for individual differences in alertness between subjects due to having just slept. We partialed out the mean PVT reaction time before the juggling performance test after sleep from the original analyses and found that our results remained stable (task proficiency: mean rho = 0.37, p = 0.025; learning curve: rhos = -0.49, p = 0.040). For a summary of the reported cluster-corrected partial correlations as well as analyses controlling for differences in sleep architecture see Figure 3 – figure supplement 3. Further, we also confirmed that our correlations are not influenced by individual differences in SO and spindle event parameters (Figure 3 – figure supplement 4)."

      And in the methods section (page 35, lines 813 – 814):

      "To control for possible confounding factors we computed cluster-corrected partial rank correlations (Figure 3 – figure supplement 3 and 4)."

      References

      Aru, J., Aru, J., Priesemann, V., Wibral, M., Lana, L., Pipa, G., Singer, W. & Vicente, R. (2015) Untangling cross-frequency coupling in neuroscience. Curr Opin Neurobiol, 31, 51-61.

      Bothe, K., Hirschauer, F., Wiesinger, H. P., Edfelder, J., Gruber, G., Birklbauer, J. & Hoedlmoser, K. (2019) The impact of sleep on complex gross-motor adaptation in adolescents. Journal of Sleep Research, 28(4).

      Bothe, K., Hirschauer, F., Wiesinger, H. P., Edfelder, J. M., Gruber, G., Hoedlmoser, K. & Birklbauer, J. (2020) Gross motor adaptation benefits from sleep after training. J Sleep Res, 29(5), e12961.

      Campbell, I. G. & Feinberg, I. (2016) Maturational Patterns of Sigma Frequency Power Across Childhood and Adolescence: A Longitudinal Study. Sleep, 39(1), 193-201.

      Dayan, E. & Cohen, L. G. (2011) Neuroplasticity subserving motor skill learning. Neuron, 72(3), 443-54. De Gennaro, L. & Ferrara, M. (2003) Sleep spindles: an overview. Sleep Med Rev, 7(5), 423-40.

      De Gennaro, L., Ferrara, M., Vecchio, F., Curcio, G. & Bertini, M. (2005) An electroencephalographic fingerprint of human sleep. Neuroimage, 26(1), 114-22.

      Dinges, D. F., Pack, F., Williams, K., Gillen, K. A., Powell, J. W., Ott, G. E., Aptowicz, C. & Pack, A. I. (1997) Cumulative sleepiness, mood disturbance, and psychomotor vigilance performance decrements during a week of sleep restricted to 4-5 hours per night. Sleep, 20(4), 267-77.

      Dinges, D. F. & Powell, J. W. (1985) Microcomputer Analyses of Performance on a Portable, Simple Visual Rt Task during Sustained Operations. Behavior Research Methods Instruments & Computers, 17(6), 652-655.

      Eichenlaub, J. B., Biswal, S., Peled, N., Rivilis, N., Golby, A. J., Lee, J. W., Westover, M. B., Halgren, E. & Cash, S. S. (2020) Reactivation of Motor-Related Gamma Activity in Human NREM Sleep. Front Neurosci, 14, 449.

      Feinberg, I. & Campbell, I. G. (2013) Longitudinal sleep EEG trajectories indicate complex patterns of adolescent brain maturation. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 304(4), R296-303.

      Hahn, M., Heib, D., Schabus, M., Hoedlmoser, K. & Helfrich, R. F. (2020) Slow oscillation-spindle coupling predicts enhanced memory formation from childhood to adolescence. Elife, 9.

      Helfrich, R. F., Lendner, J. D. & Knight, R. T. (2021) Aperiodic sleep networks promote memory consolidation. Trends Cogn Sci.

      Helfrich, R. F., Lendner, J. D., Mander, B. A., Guillen, H., Paff, M., Mnatsakanyan, L., Vadera, S., Walker, M. P., Lin, J. J. & T., K. R. (2019) Bidirectional prefrontal-hippocampal dynamics organize information transfer during sleep in humans. Nature Communications, 10(1), 3572.

      Helfrich, R. F., Mander, B. A., Jagust, W. J., Knight, R. T. & Walker, M. P. (2018) Old Brains Come Uncoupled in Sleep: Slow Wave-Spindle Synchrony, Brain Atrophy, and Forgetting. Neuron, 97(1), 221-230 e4.

      Killgore, W. D. (2010) Effects of sleep deprivation on cognition. Prog Brain Res, 185, 105-29.

      Kurth, S., Jenni, O. G., Riedner, B. A., Tononi, G., Carskadon, M. A. & Huber, R. (2010) Characteristics of sleep slow waves in children and adolescents. Sleep, 33(4), 475-80.

      Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. J Neurosci Methods, 164(1), 177-90.

      Muehlroth, B. E., Sander, M. C., Fandakova, Y., Grandy, T. H., Rasch, B., Shing, Y. L. & Werkle-Bergner, M. (2019) Precise Slow Oscillation-Spindle Coupling Promotes Memory Consolidation in Younger and Older Adults. Sci Rep, 9(1), 1940.

      Muehlroth, B. E. & Werkle-Bergner, M. (2020) Understanding the interplay of sleep and aging: Methodological challenges. Psychophysiology, 57(3), e13523.

      Niethard, N., Ngo, H. V. V., Ehrlich, I. & Born, J. (2018) Cortical circuit activity underlying sleep slow oscillations and spindles. Proceedings of the National Academy of Sciences of the United States of America, 115(39), E9220-E9229.

      Purcell, S. M., Manoach, D. S., Demanuele, C., Cade, B. E., Mariani, S., Cox, R., Panagiotaropoulou, G., Saxena, R., Pan, J. Q., Smoller, J. W., Redline, S. & Stickgold, R. (2017) Characterizing sleep spindles in 11,630 individuals from the National Sleep Research Resource. Nature Communications, 8, 15930.

      Van Dongen, H. P., Maislin, G., Mullington, J. M. & Dinges, D. F. (2003) The cumulative cost of additional wakefulness: dose-response effects on neurobehavioral functions and sleep physiology from chronic sleep restriction and total sleep deprivation. Sleep, 26(2), 117-26.

      Wilhelm, I., Metzkow-Meszaros, M., Knapp, S. & Born, J. (2012) Sleep-dependent consolidation of procedural motor memories in children and adults: the pre-sleep level of performance matters. Developmental Science, 15(4), 506-15.

      Winer, J. R., Mander, B. A., Helfrich, R. F., Maass, A., Harrison, T. M., Baker, S. L., Knight, R. T., Jagust, W. J. & Walker, M. P. (2019) Sleep as a potential biomarker of tau and beta-amyloid burden in the human brain. J Neurosci.

    1. Author Response:

      Reviewer #1:

      Maimon-Mor et al. examined the control of reaching movement of one-handers, who were born with a partial arm, and amputees, who lost their arm in adulthood. The authors hypothesized that since one-handers started using their artificial arm earlier in life then amputees, they are expected to exhibit better motor control, as measured by point-to-point reaching accuracy. Surprisingly, they found the opposite, that the reaching accuracy of one-handers is worse than that of amputees (and control with their non-dominant hand). This deficit in motor control was reflected in an increase in motor noise rather than consistent motor biases.

      Strengths:

      • I found the paper in general very well and clearly written.
      • The authors provide detailed analyses to examine various possible factors underlying deficits in reaching movements in one-handers and amputees, including age at which participants first used an artificial arm, current usage of the arm, performance in hand localization tasks, and statistical methods that control for potential confounding factors.
      • The results that one handers, who start using the artificial arm at early age, show worse motor control than amputees, who typically start using the arm during adulthood, are surprising and interesting. Also intriguing are the results that reaching accuracy is negatively correlated with the time of limbless experience in both groups. These results suggest that there is a plasticity window that is not anchored to a certain age, but rather to some interference (perhaps) from the time without the use of artificial arm. In one-handers these two time intervals are confounded by one another, but the amputees allow to separate them. I think that the results have implications for understanding plasticity aspects of acquiring skills for using artificial limbs.

      Weaknesses:

      • While I found that one of the main conclusion from the paper is that the main factor that is related to increased motor noise is the time spent without the artificial arm, it felt that this was not emphasized as such. These results are not mentioned in the abstract and the correlation for amputees is not shown in a figure.

      We thank the reviewer for their comment. While it is true that motor noise correlated with time of limbless experience in both groups, we were hesitant to highlight the results found in amputees, considering the small number of participants, and lack of converging evidence (e.g., contrary to the congenital group, we did not find a strong main effect). For these reasons, we have chosen to include it in the manuscript but not highlight it or base our main conclusions on it. Following the reviewer’s comment, the correlation of the amputees’ data is now visualised in Figure 3. Moreover, while the behavioural correlation might be similar in both groups, from a neural standpoint, the limbless experience of a toddler with a developing brain is qualitatively different to that of an adult, with a fully developed brain, who has lost a limb. As such, we were hesitant to link these two findings into a single framework, however in the revised manuscript we highlight this tentative link.

      Discussion (4th paragraph):

      “In both the congenital and acquired groups, artificial arm reaching motor noise correlated with the amount of time they spent using only their residual limb. It is therefore tempting to link these two results under a unifying interpretation; however, this requires further research, considering the neural differences between the two groups.”

      Figure 3. Years of limbless experience before first artificial arm use in the acquired group. (A) Relationship between years of limbless experience and (A) artificial arm reaching errors or (B) artificial arm motor noise in the acquired group.

      • The suggested mechanism of a deficit in visuomotor integration is not clear, and whether the results indeed point to this hypothesis. The results of the reaching task show that the one-handers exhibit higher motor noise and initial error direction than amputees. The results of the 2D localization task (the same as the standard reaching task but without visual feedback) show no difference in errors between the groups. First, it is not clear how the findings of the 2D localization task are in line with the results that one-handers show larger initial directional errors.

      We fully take on the reviewer’s comment regarding the vague use of the term visuomotor integration. In the revised manuscript, we have opted instead for a much broader term, suggesting a deficit in visual-based corrective movements, considering we are limited in our ability to infer the specific underlying mechanism from our result. We have also made changes to the abstract based on the reviewer’s comment (see below).

      With regards to discussing how the various results fit together, in the revised manuscript, these are now discussed more at length. In short, in the 2D localisation task (reaching without visual feedback), participants were not instructed to perform fast ballistic movements. Instead, participants were instructed that they could perform movements to correct for their initial aiming error (using proprioception). Together with the similar performance observed for the proprioceptive task, this strengthens our suggestion that the deficit in the congenital group is triggered by visual-driven corrections. These various considerations are now detailed as follows:

      Abstract:

      “Since we found no group differences when reaching without visual feedback, we suggest that the ability to perform efficient visually-based corrective movements, is highly dependent on either biological or artificial arm experience at a very young age.”

      Result (section 7, 1st paragraph):

      “From these results, we infer that early-life experience relates to a suboptimal ability to reduce the system’s inherent noise, and that this is possibly not related to the noise generated by the execution of the initial motor plan. Early life experience might therefore relate to better use of visual feedback in performing corrective movements. The continuous integration of visual and sensory input is at the heart of visually- driven corrective movements. Therefore, one possibility is that limited early life experience, results in suboptimal integration of information within the sensorimotor system.”

      Discussion (2nd paragraph):

      “When performing reaching movements without visual feedback (2D localisation task), the congenital group did not differ from the acquired or control group. This begs the question, if the congenital group has a deficit in motor planning why was it not evident in this task as well? In the 2D localisation task, unlike the main task, participants were allowed to make corrective movements. While they did not receive visual feedback, the proprioceptive and somatosensory feedback from the residual limb appears to be enough to allow them to correct for initial reaching errors and perform at the same level as the acquired and control group. Moreover, we did not find strong evidence for an impaired sense of localisation of either the residual or the artificial arm in the congenital group. As such, by elimination, our evidence suggests that the process of using visual information to perform corrective movements isn’t as efficient in the congenital group.”

      Discussion (2nd paragraph):

      “Lack of concurrent visual and motor experience during development might therefore cause a deficit in the ability to form the computational substrates and thus to efficiently use visual information in performing corrective movements.”

      Discussion (last paragraph):

      “By the process of elimination, we have nominated suboptimal visual feedback-based corrections to be the most likely cause underlying this motor deficit.”

      Second, I think that these results suggest that the deficiency in one-handers is with feedback responses rather than feedforward. This may also be supported by the correlation with age: early age is correlated with less end-point motor noise, rather than initial directional error. Analyses of feedback correction might help shedding more light on the mechanism. The authors mention that the participants were asked to avoid doing corrective movement and imposed a limit of 1 sec per reach to encourage that. But it is not clear whether participants actually followed these instructions. 1 sec could be enough time to allow feedback responses, especially for small amplitude movements (e.g., <10 cm).

      Please see below our response to the feedback correction analysis suggestion. Regarding corrective movements, we had the same concern as the reviewer which led us to use hand velocity data to identify first movement termination. We apologise if the experimental design and pre-processing procedures were not clear.

      In short, a 1 sec trial duration was imposed on all trials to generate a sense of time- pressure and encourage participants to perform fast ballistic movements. As we were worried that participants might still perform secondary corrective movements within this 1 sec window, for each trial, we used the hand velocity profile to identify the end of the first movement. Below, we have plotted the arm velocity from a single trial to illustrate this procedure. For this trial, the timepoint indicated by the circular marker has been identified as the time of the end of the first movement (See Methods for further information). For each trial, endpoint location was defined as the location of the arm at the movement termination timepoint defined by the kinematic data and not the endpoint at the 1 sec timepoint. It is worth noting that performing the same analysis using the end- points recorded at the 1 sec timepoint did not generate different statistical results.

      This has now been further clarified in the text.

      Results (section 1, 1st paragraph):

      “Reaching performance was evaluated by measuring the mean absolute error participants made across all targets (see Figure 1C). The absolute error refers to the distance from the cursor’s position at the end of the first reach (endpoint) to the centre of the target in each trial. The endpoint of each trial was set as the arm location at the end of the first reaching movement, identified using the trial’s kinematic data (See Methods).”

      Methods (section: Data processing and analysis – main task):

      “Within the 1 sec movement time constraint, in some trials, participants still performed secondary corrective movements. We therefore used the tangential arm velocities to identify the end of the first reach in each trial (i.e., movement termination).”

      Reviewer #2:

      This is a broad and ambitious study that is fairly unique in scope - the questions it seek to answer are difficult to answer scientifically, and yet the depth of the questions it seeks to answer and the framework in which it is founded seem out of place in a clinical journal.

      And yet, as a scientist and clinician, I found myself objecting to the claims of the authors, only have them to address my objection in the very next section. The results are surprising, but compelling - the authors have done an excellent job of untangling a very complicated question, and they have tested (for our field) a large number of subjects.

      The main two results of the paper, from my perspective, are as follows:

      1) Persons with an amputation can form better models of new environments, such as manipulandums, than can those with congenital deficiencies. This result is interesting because a) the task did not depend on significant use of the device (they were able to use their intact musculature for the reaching-based task), and b) the results were not influenced by the devices used by the subjects (cosmetic, body-powered, or myoelectric).

      2) Persons with congenital deficiency fit earlier in life had less error than those fit later in life.

      Taken together, these results suggest that during early childhood the brain is better able to develop the foundation necessary to develop internal models and that if this is deprived early in childhood, it cannot be regained later in life - even if subjects have MORE experience. (E.g., those with congenital deficiencies had more experience using their prosthetic arm than those with amputation, and yet scored worse).

      The questions analyzed by the researchers are excellent and the statistical methods are generally appropriate. My only minor concern is that the authors occasionally infer that two groups are the same when a large p-value is reported, whereas large p-values do not convey that the groups are the same; only that they cannot be proven to be different. The authors would need to use a technique such as ICC or analysis of similarities to prove the groups are the same.

      We appreciate the reviewer’s concern about inferring the null from classical frequentist statistics. In this manuscript, we have opted to using Bayesian statistics as a measure of testing the significance of similarity across groups (See Methods: Statistical analysis) as opposed to the frequentist methods suggested by the reviewer. This approach is equivalent to the ones proposed by the reviewer and are widely used in our field. A Bayesian Factor (BF) smaller than 0.33 is regarded as sufficient evidence for supporting the null hypothesis that is, that there are no differences between the groups.

      This approach is described in detail in the methods and is introduced in the first section of the results as well.

      Results (1st section 2nd paragraph):

      “To further explore the non-significant performance difference between amputees and controls, we used a Bayesian approach (Rouder et al., 2009), that allows for testing of similarities between groups (the null hypothesis). In this analysis, the smaller effect size of the two reported here (1.39) was inputted as the Cauchy prior width. The resulting Bayesian Factor (BF10=0.28) provided moderate support to the null hypothesis (i.e., smaller than 0.33).”

      Methods (Statistical analysis section):

      “In parametric analyses (ANCOVA, ANOVA, Pearson correlations), where the frequentist approach yielded a non-significant p-value, a parallel Bayesian approach was used and Bayes Factors (BF) were reported (Morey & Rouder, 2015; Rouder et al., 2009, 2012, 2016). A BF<0.33 is interpreted as support for the null-hypothesis, BF > 3 is interpreted as support for the alternative hypothesis (Dienes, 2014). In

      Bayesian ANOVAs and ANCOVA’s, the inclusion Bayes Factor of an effect (BFIncl) is reported, reflecting that the data is X (BF) times more likely under the models that include the effect than under the models without this predictor. When using a Bayesian t-test, a Cauchy prior width of 1.39 was used, this was based on the effect size of the main task, when comparing artificial arm reaches of amputees and one- handers. Therefore, the null hypothesis in these cases would be there is no effect as large as the effect observed in the main task.”

      Following the reviewer’s comment, we have carefully scanned through the manuscript to make sure no equivalence claims are made without the support of a significant BF. In one instance that has been the case and has been rectified.

      Results (3rd section, 2nd paragraph):

      “We compared artificial arm and nondominant arm biases (distance from the centre of the endpoint to the target) across groups, using intact arm biases as a covariate. The ANCOVA resulted in no significant (inconclusive) group differences (F(2,47)=2.40, p=0.1, BFIncl=0.72; see Figure 2A).”

    1. Author Response

      Reviewer #1 (Public Review):

      Redox signaling is a dynamic and concerted orchestra of inter-connected cellular pathways. There is always a debate whether ROS (reactive oxygen species) could be a friend or foe. Continued research is needed to dissect out how ROS generation and progression could diverge in physiological versus pathophysiological states. Similarly, there are several paradoxical studies (both animal and human) wherein exercise health benefits were reported to be accompanied by increases in ROS generation. It is in this context, that the present manuscript deserves attention.

      Utilizing the in-vitro studies as well as mice model work, this manuscript illustrates the different regulatory mechanisms of exercise and antioxidant intervention on redox balance and blood glucose level in diabetes. The manuscript does have some limitations and might need additional experiments and explanation.

      The authors should consider addressing the following comments with additional experiments.

      1) Although hepatic AMPK activation appears to be a central signaling element for the benefits of moderate exercise and glucose control, additional signals (on hepatic tissue) related to hepatic gluconeogenesis such as Forkhead box O1 (FoxO1), phosphoenolpyruvate carboxykinase (PEPCK), and GLUT2 needs to be profiled to present a holistic approach. Authors should consider this and revise the manuscript.

      We appreciate the constructive suggestion. Besides glycolysis, gluconeogenesis and glucose uptake are critical in maintaining liver and blood glucose homeostasis.

      FoxO1 has been tightly linked with hepatic gluconeogenesis through inhibiting the transcription of gluconeogenesis-related PEPCK and G6Pase expression (1, 2). Herein, we found the expression of FoxO1 increased in the diabetic group but reduced in the CE, IE and EE groups (Fig. X1A, Fig.5E-F in manuscript). Meanwhile, the mRNA level of Pepck and G6PC (one of the three G6Pase catalytic-subunit-encoding genes) also decreased in the CE, IE, and EE groups (Fig. X1B-1C, Fig.5H-I in manuscript). These results indicates that these three modes of exercise all inhibited gluconeogenesis through down-regulating FoxO1.

      For the glucose uptake, we detected the protein expression of GLUT2 in the liver tissue. Glut2 helps in the uptake of glucose by the hepatocytes for glycolysis and glycogenesis. Accordingly, we found GLUT2,a glucose sensor in liver, was up-regulated in diabetic rats, but down-regulated by the CE and IE intervention. However, GLUT2 didn’t decrease in the EE group, which is consistent with the results of the unimproved blood glucose by EE intervention (Figure X1A, Fig.5E and 5G in manuscript).

      Taken together, moderate exercise could benefits glucose control through increasing glycolysis and decreasing gluconeogenesis. We added this part in Page 9 line 251-263 and Figure 5E-5I in this version.

      Figure X1. A. Representative protein level and quantitative analysis of FOXO1 (82 kDa), GLUT2 (60-70 kDa) and Actin (45 kDa) in the rats in the Ctl, T2D, T2D + CE, T2D + IE and T2D + EE groups. C-D. Expression of hepatic Pepck and G6PC mRNA in the Ctl, T2D, T2D + CE, T2D + IE and T2D + EE groups were evaluated by real-time PCR analysis. Values represent mean ratios of Pepck and G6PC transcripts normalized to GAPDH transcript levels.

      2) Very recently sestrin2 signaling is assumed significant attention in relation to exercise and antioxidant responses. Therefore, authors should profile the sestrin2 levels as it is linked to several targets such as mTOR, AMPK and Sirt1. Additionally, the levels of Nrf2 should be reported as this is the central regulator of the threshold mechanisms of oxidative stress and ROS generation.

      We appreciate reviewer’s expert comments. Nrf2 is an important mediator of antioxidant signaling, playing a fundamental role in maintaining the redox homeostasis of the cell. Under unstressed conditions, Nrf2 activity is suppressed by its innate repressor Kelch-like ECH-associated protein 1 (Keap1) (3). With the increase of ROS level in the development of diabetes, Nrf2 was activated to induce the transcription of several antioxidant enzymes (4, 5).

      Nrf2 expression level has been reported to increase in HFD mice or diabetic patients (6, 7). It has been found from in vitro studies that NRF2 activation is achieved with acute exposure to high glucose, whereas longer incubation times or oscillating glucose concentration failed to activate Nrf2 (8, 9). These suggest that the increase of ROS in diabetes can cause compensatory upregulation of Nrf2. In our study, we found that Nrf2 increased in diabetic rats, which can further initiate the expression of antioxidant enzymes. As shown in Fig.X2A (Fig.2H-2K in manuscript), Grx and Trx involved in thioredoxin metabolism were up-regulated accordingly like Nrf2. After CE intervention, the level of Nrf2 increased further more (Fig.2E-2F), suggesting that CE intervention could activate antioxidant system to achieve a high-level redox balance. We have added these new results into Figure 2.

      On the other hand, the expression level of Sestrin2 and Nrf2 decreased after antioxidant supplement. Our results suggest that the antioxidant treatment improved the diabetes through inhibiting ROS level to achieve a low-level redox balance, but moderate exercise enhanced ROS tolerance to achieve a high-level balance (Fig.X2D-F, Fig.3E-3G in manuscript).

      We added the new data in “Page 5 line 147-153 and Page 7 line 183-186” and Figure 2-3 in current version.

      Figure X2. A-C. Representative protein level and quantitative analysis of Nrf2 (97 kDa), Sestrin2 (57 kDa) and Actin (45 kDa) in the rats in the Ctl, T2D and T2D + CE groups. D-F. Representative protein level and quantitative analysis of Nrf2 (97 kDa), Sestrin2 (57 kDa) and HSP90 (90 kDa) in the rats in the Ctl, T2D and T2D + APO groups.

      3) Authors should discuss the exercise-associated hormesis curve. They should discuss whether moderate exercise could decrease the sensitivity to oxidative stress by altering the bell-shaped dose-response curve.

      We thank the reviewer’s valuable comments. According to literatures, Zsolt Radak et al proposed a bell-shaped dose-response curve between normal physiological function and level of ROS in healthy individuals, and suggested that moderate exercise can extend or stretch the levels of ROS while increases the physiological function (10). Our results validated this hypothesis and further proposed that moderate exercise could produce ROS meanwhile increase antioxidant enzyme activity to maintain high level redox balance according to the Bell-shaped curve, whereas excessive exercise would generate a higher level of ROS, leading to reduced physiological function. In this study, we found the state of diabetic individuals is more applicable to the description of a S-shaped curve, due to the high level of oxidative stress and decreased reduction level in diabetic individuals (Fig.8B). With the increase of ROS, the physiological function of diabetic individuals gradually decreases and enters a state of redox imbalance. Moderate exercise shifts the S-shaped curve into a bell-shaped dose-response curve, thus reducing the sensitivity to oxidative stress in diabetic individuals and restoring redox homeostasis. However, with excessive exercise, ROS production increases beyond the threshold range of redox balance, resulting in decreased physiological function (Fig.8B, see the decreasing portion of the bell curve to the right of the apex).

      Nevertheless, the antioxidant intervention increased physiological activity by reducing ROS levels in diabetic individuals, restoring a bell-shaped dose-response curve at low level of ROS (Fig.8B). Therefore, redox balance could be achieved either at low level of ROS mediated by antioxidant intervention or at high level of ROS mediated by moderate exercise, both of which were regulated by AMPK activation. Therefore, both high and low levels of redox balance can lead to high physiological function as long as they are in the redox balance threshold range. Then, the activation of AMPK is an important sign of exercise or antioxidant intervention to obtain redox dynamic balance which helps restore physiological function. Accordingly, we speculate that the antioxidant intervention based on moderate exercise might offset the effect of exercise, but antioxidants could be beneficial during excessive exercise. The human study also supports that supplementation with antioxidants may preclude the health-promoting effects of exercise (11). Therefore, personalized intervention with respect to redox balance will be crucial for the effective treatment of diabetes patients.

      We added this part into “Discussion” in this version (Page 13-14 line 389-418).

      4) It would not be ideal to single-out AMPK as a sole biomarker in this manuscript. Instead, authors should consider AMPK activation and associated signaling in relation to redox balance. This should also be presented in Fig 7.

      We thank reviewer’s critical comments. According to the comments, we have discussed the AMPK signaling in the discussion part (Page 13, line 373-384) and added the AMPK signaling in Fig.8A.

      Reference:

      1. R. A. Haeusler, K. H. Kaestner, D. Accili, FoxOs function synergistically to promote glucose production. J Biol Chem 285, 35245-35248 (2010).
      2. J. Nakae, T. Kitamura, D. L. Silver, D. Accili, The forkhead transcription factor Foxo1 (Fkhr) confers insulin sensitivity onto glucose-6-phosphatase expression. J Clin Invest 108, 1359-1367 (2001).
      3. M. McMahon, K. Itoh, M. Yamamoto, J. D. Hayes, Keap1-dependent proteasomal degradation of transcription factor Nrf2 contributes to the negative regulation of antioxidant response element-driven gene expression. J Biol Chem 278, 21592-21600 (2003).
      4. R. S. Arnold et al., Hydrogen peroxide mediates the cell growth and transformation caused by the mitogenic oxidase Nox1. Proc Natl Acad Sci U S A 98, 5550-5555 (2001).
      5. J. M. Lee, M. J. Calkins, K. Chan, Y. W. Kan, J. A. Johnson, Identification of the NF-E2-related factor-2-dependent genes conferring protection against oxidative stress in primary cortical astrocytes using oligonucleotide microarray analysis. J Biol Chem 278, 12029-12038 (2003).
      6. T. Jiang et al., The protective role of Nrf2 in streptozotocin-induced diabetic nephropathy. Diabetes 59, 850-860 (2010).
      7. X. H. Wang et al., High Fat Diet-Induced Hepatic 18-Carbon Fatty Acids Accumulation Up-Regulates CYP2A5/CYP2A6 via NF-E2-Related Factor 2. Front Pharmacol 8, 233 (2017).
      8. T. S. Liu et al., Oscillating high glucose enhances oxidative stress and apoptosis in human coronary artery endothelial cells. J Endocrinol Invest 37, 645-651 (2014).
      9. Z. Ungvari et al., Adaptive induction of NF-E2-related factor-2-driven antioxidant genes in endothelial cells in response to hyperglycemia. Am J Physiol Heart Circ Physiol 300, H1133-1140 (2011).
      10. Z. Radak et al., Exercise, oxidants, and antioxidants change the shape of the bell-shaped hormesis curve. Redox Biol 12, 285-290 (2017).
      11. M. Ristow et al., Antioxidants prevent health-promoting effects of physical exercise in humans. Proc Natl Acad Sci U S A 106, 8665-8670 (2009).
    1. Author Response

      Reviewer #1 (Public Review):

      In one of the most creative eDNA studies I have had the pleasure to review, the authors have taken advantage of an existing program several decades old to address whether insect declines are indeed occurring - an active area of discussion and debate within ecology. Here, they extracted arthropod environmental DNA (eDNA) from pulverized leaf samples collected from different tree species across different habitats. Their aim was to assess the arthropod community composition within the canopies of these trees during the time of collection to assess whether arthropod richness, diversity, and biomass were declining. By utilizing these leaf samples, the greatest shortcoming of assessing arthropod declines - the lack of historical data to compare to - was overcome, and strong timeseries evidence can now be used to inform the discussion. Through their use of eDNA metabarcoding, they were able to determine that richness was not declining, but there was evidence of beta diversity loss due to biotic homogenization occurring across different habitats. Furthermore, their application of qPCR to assess changes in eDNA copy number temporally and associate those changes with changes to arthropod biomass provided support to the argument that arthropod biomass is indeed declining. Taken together, these data add substantial weight to the current discussion regarding how arthropods are being affected in the Anthropocene.

      Thank you very much for the positive assessment of our work.

      I find the conclusions of the paper to be sound and mostly defensible, though there are some issues to take note of that may undermine these findings.

      Firstly, I saw no explanation of the requisite controls for such an experiment. An experiment of this scale should have detailed explanations of the field/equipment controls, extraction controls, and PCR controls to ensure there are no contamination issues that would otherwise undermine the entirety of the study. At one point in the manuscript the presence of controls is mentioned just once, so I surmise they must exist. Trusting such results needs to be taken with caution until such evidence is clearly outlined. Furthermore, the plate layout which includes these controls would help assess the extent of tag-jumping, should the plate plan proposed in Taberlet et al., 2018 be adopted.

      Second, without the presence of adequate controls, filtering schemes would be unable to determine whether there were contaminants and also be unable to remove them. This would also prevent samples from being filtered out should there be excessive levels of contamination present. Without such information, it makes it difficult to fully trust the data as presented.

      Finally, there is insufficient detail regarding the decontamination procedures of equipment used to prepare the samples (e.g., the cryomil). Without clear explanations of the steps the authors took to ensure samples were handled and prepared correctly, there is yet more concern that there may be unseen problems with the dataset.

      We are well aware of the potential issues and consequences of contamination in our work. However, we are also confident that our field and laboratory procedures adequately rule out these issues. We agree with the reviewer that we should expand more on our reasoning. Hence, we have now significantly expanded the Methods section outlining controls and sample purity, particularly under “Tree samples of the German Environmental Specimen Bank – Standardized time series samples stored at ultra-low temperatures” (lines 303-304), “Test for DNA carryover in the cryomill” (lines 448-464) and “Statistical analysis” (lines 570-575).

      We ran negative control extractions as well as negative control PCRs with all samples. These controls were sequenced along with all samples and used to explore the effect of experimental contamination. With the exception of a few reads of abundant taxa, these controls were mostly clean. We report this in more detail now in the Methods under “Sequence analysis” (lines 570-575). This suggests that our data are free of experimental contamination or tag jumping issues.

      We have also expanded on the avoidance of contamination in our field sampling protocols. The ESB has been set up for monitoring even the tiniest trace amounts of chemicals. Carryover between samples would render the samples useless. Hence, highly clean and standardized protocols are implemented. All samples are only collected with sterilized equipment under sterile conditions. Each piece of equipment is thoroughly decontaminated before sampling.

      The cryomill is another potential source of cross-contamination. The mill is disassembled after each sample and thoroughly cleaned. Milled samples have already been tested for chemical carryover, and none was found. We have now added an additional analysis to rule out DNA carryover. We received the milling schedule of samples for the past years. Assuming samples get contaminated by carryover between milling runs, two consecutive samples should show signatures of this carryover. We tested this for singletaxon carryover as well as community-wide beta diversity, but did not find any signal of contamination. This gives us confidence that our samples are very pure. The results of this test are now reported in the manuscript (Suppl. Fig 12 & Suppl. Table 3).

      Reviewer #2 (Public Review):

      Krehenwinkel et al. investigated the long-term temporal dynamics of arthropod communities using environmental DNA (eDNA) remained in archived leave samples. The authors first developed a method to recover arthropod eDNA from archived leave samples and carefully tested whether the developed method could reasonably reveal the dynamics of arthropod communities where the leave samples originated. Then, using the eDNA method, the authors analyzed 30-year-long well-archived tree leaf samples in Germany and reconstructed the long-term temporal dynamics of arthropod communities associated with the tree species. The reconstructed time series includes several thousand arthropod species belonging to 23 orders, and the authors found interesting patterns in the time series. Contrary to some previous studies, the authors did not find widespread temporal α-diversity (OTU richness and haplotype diversity) declines. Instead, β-diversity among study sites gradually decreased, suggesting that the arthropod communities are more spatially homogenized in recent years. Overall, the authors suggested that the temporal dynamics of arthropod communities may be complex and involve changes in α- and β-diversity and demonstrated the usefulness of their unique eDNA-based approach.

      Strengths:

      The authors' idea that using eDNA remained in archived leave samples is unique and potentially applicable to other systems. For example, different types of specimens archived in museums may be utilized for reconstructing long-term community dynamics of other organisms, which would be beneficial for understanding and predicting ecosystem dynamics.

      A great strength of this work is that the authors very carefully tested their method. For example, the authors tested the effects of powdered leaves input weights, sampling methods, storing methods, PCR primers, and days from last precipitation to sampling on the eDNA metabarcoding results. The results showed that the tested variables did not significantly impact the eDNA metabarcoding results, which convinced me that the proposed method reasonably recovers arthropod eDNA from the archived leaf samples. Furthermore, the authors developed a method that can separately quantify 18S DNA copy numbers of arthropods and plants, which enables the estimations of relative arthropod eDNA copy numbers. While most eDNA studies provide relative abundance only, the DNA copy numbers measured in this study provide valuable information on arthropod community dynamics.

      Overall, the authors' idea is excellent, and I believe that the developed eDNA methodology reasonably reconstructed the long-term temporal dynamics of the target organisms, which are major strengths of this study.

      Thank you very much for the positive assessment of our work.

      Weaknesses:

      Although this work has major strengths in the eDNA experimental part, there are concerns in DNA sequence processing and statistical analyses.

      Statistical methods to analyze the temporal trend are too simplistic. The methods used in the study did not consider possible autocorrelation and other structures that the eDNA time series might have. It is well known that the applications of simple linear models to time series with autocorrelation structure incorrectly detect a "significant" temporal trend. For example, a linear model can often detect a significant trend even in a random walk time series.

      We have now reanalyzed our data controlling for autocorrelation and for non-linear changes of abundance and recover no change to our results. We have added this information to the manuscript under “Statistical analysis” (lines 629-644).

      Also, there are some issues regarding the DNA sequence analysis and the subsequent use of the results. For example, read abundance was used in the statistical model, but the read abundance cannot be a proxy for species abundance/biomass. Because the total 18S DNA copy numbers of arthropods were quantified in the study, multiplying the sequence-based relative abundance by the total 18S DNA copy numbers may produce a better proxy of the abundance of arthropods, and the use of such a better proxy would be more appropriate here. In addition, a coverage-based rarefaction enables a more rigorous comparison of diversity (OTU diversity or haplotype diversity) than the readbased rarefaction does.

      We did not use read abundance as a proxy for abundance, but used our qPCR approach to measure relative copy number of arthropods. While there are biases to this (see our explanations above), the assay proved very reliable and robust. We thus believe it should indeed provide a rough estimate of biomass. As biomass is very commonly discussed in insect decline (in fact the first study on insect decline entirely relies on biomass; Hallmann et al. 2017), we feel it is important go include a proxy for this as well. However, we also discuss the alternative option that a turnover of diversity is affecting the measured biomass. A pattern of abundance loss for common species has been described in other works on insect decline.

      We liked the reviewer’s suggestion to use copy number information to perform abundance-informed rarefaction. We have done this now and added an additional analysis rarefying by copy number/biomass. A parallel analysis using this newly rarefied table was done for the total diversity as well as single species abundance change. Details can be found in the Methods and Results section of the manuscript. However, the result essentially remains the same. Even abundance-informed rarefaction does not lead to a pattern of loss of species richness over time (see “Statistical analysis”).

      The overall results are supporting a scenario of no overall loss of species richness over time, but a loss of abundance for common species. And we indeed see the pattern of declining abundance for once-common species in our data, for example the loss of the Green Silver-Line moth, once a very common species in beech canopy (Suppl. Fig. 10). We have added details on this to the Discussion (lines 254-260).

      These points may significantly impact the conclusions of this work.

      Reviewer #3 (Public Review):

      The aim of Weber and colleagues' study was to generate arthropod environmental DNA extracted from a unique 30-year time series of deep-frozen leaf material sampled at 24 German sites, that represent four different land use types. Using this dataset, they explore how the arthropod community has changed through time in these sites, using both conventional metabarcoding to reconstruct the OTUs present, and a new qPCR assay developed to estimate the overall arthropod diversity on the collected material. Overall their results show that while no clear changes in alpha diversity are found, the βdiversity dropped significantly over time in many sites, most notable in the beech forests. Overall I believe their data supports these findings, and thus their conclusion that diversity is becoming homogenized through time is valid.

      Thank you for the positive assessment.

      While overall I do not doubt the general findings, I have a number of comments. Firstly while I agree this is a very nice study on a unique dataset - other temporal datasets of insects that were used for eDNA studies do exist, and perhaps it would be relevant to put the findings into context (or even the study design) of other work that has been done on such datasets. One example that jumps to my mind is Thomsen et al. 2015 https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.12452 but I am sure there are others.

      We have expanded the introduction and discussion on this citing this among other studies now (lines 71-72, 276-278).

      From a technical point of view, the conclusions of course rely on several assumptions, including (1) that the biomass assay is effective and (2) that the reconstructed levels of OTU diversity are accurate,

      With regards to biomass although it is stated in the manuscript that "Relative eDNA copy number should be a predictor for relative biomass ", this is in fact only true if one assumes a number of things, e.g. there is a similar copy number of 18s rDNA per species, similar numbers of mtDNA per cell, a similar number of cells per individual species etc. In this regard, on the positive side, it is gratifying to see that the authors perform a validation assay on 7 mock controls, and these seem to indicate the assay works well. Given how critical this is, I recommend discussing the details of this a bit more, and why the authors are convinced the assay is effective in the main text so that the reader is able to fully decide if they are in agreement. However perhaps on the negative side, I am concerned about the strategy taken to perform the qPCR may have not been ideal. Specifically, the assay is based on nested PCR, where the authors first perform a 15cycle amplification, this product is purified, then put into a subsequent qPCR. Given how both PCR is notorious for introducing amplification biases in general (especially when performed on low levels of DNA), and the fact that nested PCRs are notoriously contamination prone - this approach seems to be asking for trouble. This raises the question - why not just do the qPCR directly on the extracts (one can still dilute the plant DNA 100x prior to qPCR if needed). Further, given the qPCRs were run in triplicate I think the full data (Ct values) for this should be released (as opposed to just stating in the paper that the average values were used). In this way, the readers will be able to judge how replicable the assay was - something I think is critical given how noisy the patterns in Fig S10 seem to be.

      We agree with this point, and this is why we do not want to overstate the decline in copy number. This is an additional source of data next to genetic and species diversity. We have added to our discussion of turnover as another potential driver of copy number change (lines 257-260). We have also added text addressing the robustness of the mock community assay (lines 138-141).

      However, we are confident of the reliability and robustness of our qPCR assay for the detection of relative arthropod copy number. We performed several validations and optimizations before using the assay. We have added additional details to the manuscript on this (see “Detection of relative arthropod DNA copy number using quantitative PCR”, lines 548-556). We got the idea for the nested qPCR from a study (Tran et al.) showing its high accuracy and reproducibility. We show that our assay has a very high replicability using triplicates of each qPCR, which we will now include in the supplementary data on Dryad. The SD of Ct values is very low (~ 0.1 on average). NTC were run with all qPCRs to rule out contamination as an issue in the experiments. We also find a very high efficiency of the assay. At dilutions far outside the observed copy number in our actual leaf data, we still find the assay to be accurate. We found very comparable abundance changes across our highly taxonomically diverse mock communities. This also suggests that abundance changes are a more likely explanation than simple turnover for the observed drop in copy number. A biomass loss for common species is well in line with recent reports on insect decline. We can also rely on several other mock community studies (Krehenwinkel et al. 2017 & 2019) where we used read abundance of 18S and found it to be a relatively good predictor of relative biomass.

      The pattern in Fig. S10 is not really noisy. It just reflects typical population fluctuations for arthropods. Most arthropod taxa undergo very pronounced temporal abundance fluctuations between years.

      Next, with regards to the observation that the results reveal an overall decrease in arthropod biomass over time: The authors suggest one alternate to their theory, that the dropping DNA copy number may reflect taxonomic turnover of species with different eDNA shedding rates. Could there be another potential explanation - simply be that leaves are getting denser/larger? Can this be ruled out in some way, e.g. via data on leaf mass through time for these trees? (From this dataset or indeed any other place).

      This is a very good point. However, we can rule out this hypothesis, as the ESB performs intensive biometric data analysis. The average leaf weight and water content have not significantly changed in our sites. We have addressed this in the Methods section (see ”Tree samples of the German Environmental Specimen Bank – Standardized time series samples stored at ultra-low temperatures”, lines 308-311).

      With regards to estimates of OTU/zOTU diversity. The authors state in the manuscript that zOTUs represent individual haplotypes, thus genetic variation within species. This is only true if they do not represent PCR and/or sequencing errors. Perhaps therefore they would be able to elaborate (for the non-computational/eDNA specialist reader) on why their sequence processing methods rule out this possibility? One very good bit of evidence would be that identical haplotypes for the individual species are found in the replicate PCRs. Or even between different extractions at single locations/timepoints.

      We have repeated the analysis of genetic variation with much more stringent filtering criteria (see “Statistical analysis”, lines 611-615). Among other filtering steps, this also includes the use of only those zOTUs that occur in both technical replicates, as suggested by the reviewer. Another reason to make us believe we are dealing with true haplotypic variation here is that haplotypes show geographic variation. E.g., some haplotypes are more abundant in some sites than in others. NUMTS would consistently show a simple correlation in their abundance with the most abundant true haplotype.

      With regards to the bigger picture, one thing I found very interesting from a technical point of view is that the authors explored how modifying the mass of plant material used in the extraction affects the overall results, and basically find that using more than 200mg provides no real advantage. In this regard, I draw the authors and readers attention to an excellent paper by Mata et al. (https://onlinelibrary.wiley.com/doi/full/10.1111/mec.14779) - where these authors compare the effect of increasing the amount of bat faeces used in a bat diet metabarcoding study, on the OTUs generated. Essentially Mata and colleagues report that as the amount of faeces increases, the rare taxa (e.g. those found at a low level in a single faeces) get lost - they are simply diluted out by the common taxa (e.g those in all faeces). In contrast, increasing biological replicates (in their case more individual faecal samples) increased diversity. I think these results are relevant in the context of the experiment described in this new manuscript, as they seem to show similar results - there is no benefit of considerably increasing the amount of leaf tissue used. And if so, this seems to point to a general principal of relevance to the design of metabarcoding studies, thus of likely wide interest.

      Thank you for this interesting study, which we were not aware of before. The cryomilling is an extremely efficient approach to equally disperse even traces of chemicals in a sample. This has been established for trace chemicals early during the operation of the ESB, but also seems to hold true for eDNA in the samples. We have recently done more replication experiments from different ESB samples (different terrestrial and marine samples for different taxonomic groups) and find that replication of extraction does not provide much more benefit than replication of PCR. Even after 2 replicates, diversity approaches saturation. This can be seen in the plot below, which shows recovered eDNA diversity for different ESB samples and different taxonomic groups from 1-4 replicates. A single extract of a small volume contains DNA from nearly all taxa in the community. Rare taxa can be enriched with more PCR replicates.

    1. Author response

      Reviewer #1 (Public Review):

      This careful study reports the importance of Rab12 for Parkinson's disease associated LRRK2 kinase activity in cells. The authors carried out a targeted siRNA screen of Rab substrates and found lower pRab10 levels in cells depleted of Rab12. It has previously been reported that LLOMe treatment of cells breaks lysosomes and with time, leads to major activation of LRRK2 kinase. Here they show that LLOMe-induced kinase activation requires Rab12 and does not require Rab12 phosphorylation to show the effect.

      We thank the reviewer for their comments regarding the carefulness and importance of our work and for their specific feedback which has substantially improved our revised manuscript.

      1) Throughout the text, the authors claim that "Rab12 is required for LRRK2 dependent phosphorylation" (Page 4 line 78; Page 9 line 153; Page 22 line 421). This is not correct according to Figure 1 Figure Supp 1B - there is still pRab10. It is correct only in relation to the LLOMe activation. Please correct this error.

      We appreciate the reviewer’s comment around the requirement of Rab12 for LRRK2-dependent phosphorylation of Rab10 and question regarding whether this is relevant under baseline conditions or only in relation to LLOMe activation. Using our MSD-based assay to quantify pT73 Rab10 levels under basal conditions, we observed a similar reduction in Rab10 phosphorylation when we knockdown Rab12 as we also observed with LRRK2 knockdown (Figure 1A). Further, we see comparable reduction in Rab10 phosphorylation in RAB12 KO cells as that observed in LRRK2 KO cells using our MSD-based assay (Figure 2A and B). Based on this data, we believe Rab12 is a key regulator of LRRK2 activation under basal conditions without additional lysosomal damage. However, as the reviewer noted, we do observe some residual Rab10 phosphorylation upon Rab12 knockdown when assessed by western blot analysis (Figure 1D and Figure 1- figure supplement 1). A similar signal is observed upon LRRK2 knockdown, which may suggest that some small amount of Rab10 phosphorylation may be mediated by another kinase in this cell model. Nevertheless, we appreciate this reviewer’s point and have therefore modified the text to remove any reference to Rab12 being required for LRRK2-dependent Rab phosphorylation and now instead refer to Rab12 as a regulator of LRRK2 activity.

      As noted by the reviewer, our data does suggest that Rab12 is required for the increase in Rab10 phosphorylation observed following LLOMe treatment to elicit lysosomal damage, and we now refer to this appropriately throughout the text.

      2) The authors conclude that Rab12 recruitment precedes that of LRRK2 but the rate of recruitment (slopes of curves in 3F and G) is actually faster for LRRK2 than for Rab12 with no proof that Rab12 is faster-please modify the text-it looks more like coordinated recruitment.

      The reviewer raises an excellent point regarding our ability to delineate whether Rab12 recruitment precedes that of LRRK2 on lysosomes following LLOMe treatment. As noted by the reviewer, we do see both the recruitment of Rab12 and LRRK2 to lysosomes increase on a similar timescale, so we cannot truly resolve whether Rab12 recruitment precedes LRRK2 recruitment in our studies. Based on this, we have modified the text to emphasize that this data supports coordinated recruitment, as suggested, and we have further removed any mention of Rab12 preceding LRRK2. The specific change is as follows “Rab12 colocalization with LRRK2 increased over time following LLOMe treatment, supporting potential coordinated recruitment of these proteins to lysosomes upon damage (Figure 3I). Together, these data demonstrate that Rab12 and LRRK2 both associate with lysosomes following membrane rupture.” and can be found on lines 460-463 of the updated manuscript.

      3) The title is misleading because the authors do not show that Rab12 promotes LRRK2 membrane association. This would require Rab12 to be sufficient to localize LRRK2 to a mislocalized Rab12. The authors DO show that Rab12 is needed for the massive LLOME activation at lysosomes. Please re-word the title.

      To address the reviewer’s concern regarding the title of our manuscript, we have modified the title from “Rab12 regulates LRRK2 activity by promoting its localization to lysosomes” to “Rab12 regulates LRRK2 activity by facilitating its localization to lysosomes” to soften the language around the sufficiency of Rab12 in regulating the localization of LRRK2 to lysosomes. We show that Rab12 deletion significantly reduces LRRK2 activity (as assessed by Rab10 phosphorylation on lysosomes) and significantly increases the localization of LRRK2 to lysosomes upon lysosomal damage. The updated title better reflects the regulatory role of Rab12 in modulating LRRK2 activity, and we thank the reviewer for their suggestion to modify this accordingly.

      Reviewer #2 (Public Review):

      This study shows that rab12 has a role in the phosphorylation of rab10 by LRRK2. Many publications have previously focused on the phosphorylation targets of LRRK2 and the significance of many remains unclear, but the study of LRRK2 activation has mostly focused on the role of disease-associated mutations (in LRRK2 and VPS35) and rab29. The work is performed entirely in an alveolar lung cell line, limiting relevance for the nervous system. Nonetheless, the authors take advantage of this simplified system to explore the mechanism by which rab12 activates LRRK2. In general, the work is performed very carefully with appropriate controls, excluding trivial explanations for the results, but there are several serious problems with the experiments and in particular the interpretation.

      We appreciate the reviewer’s comments regarding the rigor of our work and the potential impact of our studies to address a key unanswered question in the field regarding the mechanisms by which LRRK2 activation is mediated. Our studies focused on the A549 cell model given its high endogenous expression of LRRK2 and Rab10, and this cell line provided a simple system to investigate the mechanism and impact of Rab12-dependent regulation of LRRK2 activity. We agree with the reviewer that future studies are warranted to understand whether similar Rab12-dependent regulation of LRRK2 occurs in relevant CNS cell types.

      First, the authors note that rab29 appears to have a smaller or no effect when knocked down in these cells. However, the quantitation (Fig1-S1A) shows a much less significant knockdown of rab29 than rab12, so it would be important to repeat this with better knockdown or preferably a KO (by CRISPR) before making this conclusion. And the relationship to rab29 is important, so if a better KD or KO shows an effect, it would be important to assess by knocking down rab12 in the rab29 KO background.

      The reviewer raises a good point regarding the importance of confirming that loss of Rab29 has no effect on Rab10 phosphorylation. To address potential concerns about insufficient Rab29 knockdown, we measured the levels of pT73 Rab10 in RAB29 KO A549 cells by MSD-based analysis. RAB29 deletion had no effect on Rab10 phosphorylation, confirming findings from our RAB siRNA screen and the observations of Dario Alessi’s group reported previously (Kalogeropulou et al Biochem J 2020; PMID: 33135724). We have included this new data into our updated manuscript in Figure 1- figure supplement 1 and comment on it on page 6 in the updated Results section.

      Secondly, the knockdown of rab12 generally has a strong effect on the phosphorylation of the LRRK2 substrate rab10 but I could not find an experiment that shows whether rab12 has any effect on the residual phosphorylation of rab10 in the LRRK2 KO. There is not much phosphorylation left in the absence of LRRK2 but maybe this depends on rab12 just as much as in cells with LRRK2 and rab12 is operating independently of LRRK2, either through a different kinase or simply by making rab10 more available for phosphorylation. The epistasis experiment is crucial to address this possibility. To establish the connection to LRRK2, it would also help to compare the effect of rab12 KD on the phosphorylation of selected rabs that do or do not depend on LRRK2.

      The reviewer raises an interesting question regarding whether Rab12 can further reduce Rab10 phosphorylation independently of LRRK2. Using our quantitative MSD-based assay, we observe that pRab10 levels are at the lower limits of detection of the assay in LRRK2 KO A549 cells. Unfortunately, this means that we are unable to detect whether there might be any additional minor reduction in Rab10 phosphorylation with Rab12 knockdown in LRRK2 KO cells. We cannot rule out that Rab12 may play a LRRK2-independent role in regulating Rab10 phosphorylation in other cell lines, and future studies are warranted to explore whether Rab12 knockdown can further reduce Rab10 phosphorylation in other systems, including in CNS cells.

      Regarding exploring the effects of RAB12 knockdown on the phosphorylation of other Rabs, we also assessed the impact of RAB12 KO on phosphorylation of another LRRK2-Rab substrate, Rab8a. We observed a strong reduction in pT72 Rab8a levels in RAB12 KO cells compared to wildtype cells, suggesting the impact of RAB12 deletion extends beyond Rab10 (see representative western blot in Author response image 1). Due to potential concerns with the selectivity of the pT72 Rab8a antibody (potentially detecting the phosphorylation of other LRRK2-Rabs), we cannot definitively demonstrate that Rab12 mediates the phosphorylation of other Rabs. This question should be revisited when additional phospho-Rab antibodies become available that enable us to selectively detect LRRK2-dependent phosphorylation of additional Rab substrates under endogenous expression conditions.

      Author response image 1.

      A strength of the work is the demonstration of p-rab10 recruitment to lysosomes by biochemistry and imaging. The demonstration that LRRK2 is required for this by biochemistry (Fig 4A) is very important but it would also be good to determine whether the requirement for LRRK2 extends to imaging. In support of a causal relationship, the authors also state that lysosomal accumulation of rab12 precedes LRRK2 but the data do not show this. Imaging with and without LRRK2 would provide more compelling evidence for a causative role.

      We thank the reviewer for their suggestion to assess Rab12 recruitment to damaged lysosomes with and without LRRK2 using imaging-based analyses to add confidence to our findings from biochemical approaches. To address this comment, we have imaged the recruitment of mCherry-tagged Rab12 to lysosomes (as assessed using an antibody against endogenous LAMP1) and observed a significant increase in Rab12 levels on lysosomes following LLOMe treatment. This occurs to a similar extent in LRRK2 KO A549 cells, suggesting that Rab12 is an upstream regulator of LRRK2 activity. This new data has been incorporated into the revised manuscript (Figure 3E) and is presented on page 20 of the updated manuscript.

      Our conclusions on this are further strengthened by new data assessing Rab12 recruitment to lysosomes using orthogonal analysis of isolated lysosomes biochemically. Using the Lyso-IP method, we observed a strong increase in the levels of Rab12 on lysosomes following LLOMe treatment that was maintained in LRRK2 KO cells. These data have been added to the updated manuscript (new data added to Figure 3- figure supplement 1).

      Together, these data support our hypothesis that Rab12 recruitment to damaged lysosomes is upstream, and independent, of LRRK2.

      The authors also touch base with PD mutations, showing that loss of rab12 reduces the phosphorylation of rab10. However, it is interesting that loss of rab12 has the same effect with R1441G LRRK2 and D620N VPS35 as it does in controls. This suggests that the effect of rab12 does not depend on the extent of LRRK2 activation. It is also surprising that R1441G LRRK2 does not increase p-rab10 phosphorylation (Fig 2G) as suggested in the literature and stated in the text.

      We agree with the reviewer that it is quite interesting that RAB12 knockdown significantly attenuates Rab10 phosphorylation in the context of PD-linked variants in addition to that observed in wildtype cells basally and after LLOMe treatment. As noted by the reviewer, we did not observe increased levels of phospho-Rab10 in LRRK2 R1441G KI A549 cells at the whole cell level (Figure 2G). However, we observed a significant increase in Rab10 phosphorylation on isolated lysosomes from LRRK2 R1441G KI cells compared to WT cells (Figure 4B). This may suggest that the LRRK2 R1441G variant leads to a more modest increase in LRRK2 activity in this cell model. Previous studies in MEFs from LRRK2 R1441G KI mice or neutrophils from human subjects that carry the LRRK2 R1441G variant showed a 3-4 fold increase in Rab10 phosphorylation (Fan et al Acta Neuropathol 2021 PMID: 34125248 and Karaye et al Mol Cell Proteomics 2020 PMID: 32601174), supporting that this variant does lead to increased Rab10 phosphorylation and that the extent of LRRK2 activation may vary across different cell types.

      Most important, the final figure suggests that PD-associated mutations in LRRK2 and VPS35 occlude the effect of lysosomal disruption on lysosomal recruitment of LRRK2 (Fig 4D) but do not impair the phosphorylation of rab10 also triggered by lysosomal disruption (4A-C). Phosphorylation of this target thus appears to be regulated independently of LRRK2 recruitment to the lysosome, suggesting another level of control (perhaps of kinase activity rather than localization) that has not been considered.

      The reviewer suggests an interesting hypothesis around the existence of additional levels of control beyond the lysosomal levels of LRRK2 to lead to increased Rab10 phosphorylation of lysosomes. Given the variability we have observed in measuring endogenous LRRK2 levels on lysosomes, we performed two additional replicates to assess lysosomal LRRK2 levels in LRRK2 R1441G KI and VPS35 D620N KI cells at baseline and after treatment with LLOMe. We observed a significant increase in LRRK2 levels on lysosomes in cells expressing either PD-linked variant and a trend toward a further increase in the levels of LRRK2 on lysosomes after LLOMe treatment in these cells (Figure 4D in the updated manuscript). We have updated the text on page 24 to reflect this change, suggesting that the PD-linked variants do not fully occlude the effect of lysosomal disruption on the lysosomal recruitment of LRRK2.

      LLOMe treatment leads to a stronger increase in Rab10 phosphorylation on lysosomes from LRRK2 R1441G and VPS35 D620N cells compared to the modest increase in LRRK2 levels observed. This could suggest that, as the reviewer noted, additional mechanisms beyond increased lysosomal localization of LRRK2 may be driving the robust increase in Rab10 phosphorylation observed. We have modified the results section on lines 548-551 to highlight this possibility: “Rab10 phosphorylation showed a more significant increase in response to LLOMe treatment than LRRK2 on lysosomes from LRRK2 R1441G and VPS35 D620N KI cells, suggesting that there may be more regulation beyond the enhanced proximity between LRRK2 and Rab that contribute to LRRK2 activation in response to lysosomal damage.”

      Reviewer #3 (Public Review):

      Increased LRRK2 kinase activity is known to confer Parkinson's disease risk. While much is known about disease-causing LRRK2 mutations that increase LRRK2 kinase activity, the normal cellular mechanisms of LRRK2 activation are less well understood. Rab GTPases are known to play a role in LRRK2 activation and to be substrates for the kinase activity of LRRK2. However, much of the data on Rabs in LRRK2 activation comes from over-expression studies and the contributions of endogenously expressed Rabs to LRRK2 activation are less clear. To address this problem, Bondar and colleagues tested the impact of systematically depleting candidate Rab GTPases on LRRK2 activity as measured by its ability to phosphorylate Rab10 in the human A549 type 2 pneumocyte cell line. This resulted in the identification of a major role for Rab12 in controlling LRRK2 activity towards Rab10 in this model system. Follow-up studies show that this role for Rab12 is of particular importance for the phosphorylation of Rab10 by LRRK2 at damaged lysosomes. Increases in LRRK2 activity in cells harboring disease-causing mutants of LRRK2 and VPS35 also depend (at least partially) on Rab12. Confidence in the role of Rab12 in supporting LRRK2 activity is strengthened by parallel experiments showing that either siRNA-mediated depletion of Rab12 or CRISPR-mediated Rab12 KO both have similar effects on LRRK2 activity. Collectively, these results demonstrate a novel role for Rab12 in supporting LRRK2 activation in A549 cells. It is likely that this effect is generalizable to other cell types. However, this remains to be established. It is also likely that lysosomes are the subcellular site where Rab12-dependent activation of LRRK2 occurs. Independent validation of these conclusions with additional experiments would strengthen this conclusion and help to address some concerns that much of the data supporting a lysosome localization for Rab12-dependent activation of LRRK2 comes from a single method (LysoIP). Furthermore, there is a discrepancy between panel 4A versus 4D in the effect of LLoMe-induced lysosome damage on LRRK2 recruitment to lysosomes that will need to be addressed to strengthen confidence in conclusions about lysosomes as sites of LRRK2 activation by Rab12.

      We thank the reviewer for their comments regarding our work that identifies Rab12 as a novel regulator of LRRK2 activation and the appreciation of the parallel approaches we employed to add confidence in this effect.

      As suggested by the reviewer, we have updated our manuscript to now include independent validation of our conclusions using imaging-based analyses to complement our data from biochemical analyses using the Lyso-IP method. Specifically, we have included new imaging data that confirms that Rab12 levels are increased on lysosomes following membrane permeabilization with LLOMe treatment and demonstrates that this occurs independent of LRRK2, providing additional support that Rab12 is an upstream regulator of LRRK2 activity (Figure 3E in the updated manuscript).

      Regarding the reviewer’s comment on a discrepancy between our findings in Figure 4A and Figure 4D, we have performed additional independent replicates in Figure 4D to assess the impact of lysosomal damage on the lysosomal levels of LRRK2 at baseline or upon the expression of genetic variants. We observed a significant increase in LRRK2 levels on lysosomes following LLOMe treatment in our set of experiments included in Figure 4A and a non-significant trend toward an increase in LRRK2 levels on isolates lysosomes in Figure 4D. As described in more detail below (in response to the second point raised by this reviewer), we think this variability arises because of a combination of low levels of LRRK2 on lysosomes with endogenous expression and variability across experiments in the efficiency of lysosomal isolation. Our observations of increased recruitment of LRRK2 to lysosomes upon damage are further supported by parallel imaging-based studies (Figure 3F-I) and are consistent with previous studies using overexpression systems.

      We thank the reviewer for all of the suggestions which have added further confidence to our conclusions and substantially improved the manuscript.

    1. Author Response:

      Reviewer #1:

      This work is aiming at the characterization of the molecular and kinetic mechanism of how three members of the SLC6 family of transporters, namely for dopamine (DAT), norepinephrine (NET), and serotonin (SERT), transport substrate across the membrane, and how the transport process is affected by cations. The authors use electrophysiology and sophisticated rapid solution exchange methods, in conjunction with fluorescence recordings from single cells, to correlate flux (from fluorescence) with electrical activity (from currents).

      The strength of the methods is based on the application of a kinetic method with high time resolution, allowing the isolation of fast processes in the transport mechanism, and their modeling using a kinetic multistep scheme. In particular useful is the combination with fluorescence recording from single cells, which allows the authors to measure flux and current in the same cell under voltage clamp conditions. This is an elegant approach to get information on the voltage dependence of substrate flux, which is difficult to obtain with other methods. As to the strength of the results, the data are generally of high quality, showing the kinetic and mechanistic similarities and differences between the three transporters under observation. Another strength is that the results are quantitatively represented by kinetic simulations, which appear to fit the experimental data well.

      The major weakness of the research is related to interpretation of the experimental results. While the authors propose a unified K+ interaction mechanism for the three transporters, DAT, NET and SERT, the proposed K+ association/dissociation mechanism is 1) highly unusual, and 2) not unique in the ability to explain the experimental data. As to point 1), the DAT mechanism (Fig. 7A) proposes a sequence of intracellular K+ association and dissociation steps. Since the intracellular [K+] remains constant, such a sequence requires a change of affinity for K+, which is initially high when K+ associates (33 microM according to the provided rate constants) and then has to be low for K+ dissociation (3.3 mM). Such an affinity change requires input of free energy, to promote K+ dissociation. From the provided rate constants and at room temperature this free energy change can be approximated as 11.4 kJ/mol. This is a large energy amount, in fact larger than what is stored in the physiological concentration gradient for one Na+ ion as a driving force for transport. It appears that the transporter would waste a lot of energy for no apparent benefit, with a futile K+ association/dissociation cycle, that would just generate heat.

      Therefore, while the authors have achieved their aim of quantitatively assessing transporter function and thorough description by a kinetic mechanism, their final proposed mechanism does not support all of the conclusions because it is by far from unique in being able to explain the data (point 2) above). While this may be true for other transport mechanisms proposed in the past, the mechanism proposed here is somewhat odd with respect to energy requirements. Thus, it would require extraordinary experimental proof to propose it in exclusion of other, maybe more plausible mechanisms.

      Despite these shortcomings, the potential impact of the work is high, because a unifying theory of cation interaction and stoichiometry of the monoamine transporter members of the SLC6 family has been missing in the literature. In addition, the elegant method of combining single cell electrophysiology and fluorescence flux measurements is impactful, especially in the whole cell recording method, allowing the control of intracellular ionic composition.

      We thank reviewer 1 for his comments on the kinetic modelling. We do not claim that the mechanism, which we propose, is unique in its ability to explain the data. However, we should like to argue that the proposed mechanism is plausible and parsimonious. We, much like reviewer 1, initially asked the question, whether a mechanism requiring an ion such as potassium to associate and subsequently dissociate from the same side of the transporter was energetically feasible. In fact, one of the main reasons for employing kinetic models was to address this specific issue.

      If detailed balance in a kinetic model is maintained (i.e., the product of the rates in the forward direction of a loop equals the product of the rates in the reverse direction), the model is energetically sound (i.e., such a model does not violate the laws of thermodynamics). It is true that for a spontaneous reaction to occur, the Gibbs free energy has to be negative. In a multistep process, however, this consideration only pertains to the “initial” and the “final” state. As long as the Gibbs free energy between these two states is negative the reaction will proceed, even if the Gibbs free energy between “intermediate” states is positive. This point is illustrated in the schemes below.

      Scheme (A) maps out the Gibbs free energy of the outer loop of the kinetic model of DAT (i.e., this path describes the conformational trajectory, which the transporter takes in the presence of intracellular K+- see scheme in Fig.7A of the manuscript). For calculating the Gibbs free energy of this loop, we assumed a pre-equilibrium condition (i.e., an extracellular and intracellular substrate concentration that we arbitrarily set to 10 μM and 100 nM, respectively) and the membrane voltage as 0 mV. As shown in the scheme, the Gibbs free energy between the “initial-left” and the “final-right” state is negative. Accordingly, the multistep reaction can proceed spontaneously.

      In scheme (B), we mapped out the Gibbs free energy for the same path and the same pre-equilibrium condition as shown in scheme (A); the only difference is that the membrane potential was now assumed to be -60 mV. This is to show that voltage is also a determining factor of the extent by which the Gibbs free energy changes.

      In Scheme (C), we mapped out the Gibbs free energy at equilibrium (the difference in Gibbs free energy between the “initial” and the “final” state is zero). This condition is met when the intracellular substrate concentration is 155 μM. At this intracellular substrate concentration, the energy stored in the substrate gradient notably matches exactly the energy of the Na+ gradient. The model therefore predicts that no energy is dissipated as heat, an observation that is in contrast to the concern raised by reviewer 1. We admit that the model can be criticized on this ground, because arguably, a realistic process is expected to dissipate energy as heat even if it involves a microscopic system (as is the case here). Determination of how much heat is generated in a transport cycle is, however, beyond the scope of the present manuscript and warrants a detailed study. In such a study, one could investigate if any heat loss generated can be compensated by, for instance, the occasional antiport of K+ by DAT, which, as we point out in the discussion, is possible. In this context, we stress that the energetic costs would have been much higher, if we had assumed non-obligatory antiport of K+ through DAT. Such a mechanism predicts that the K+ gradient is constantly dissipated in the absence of the substrate, which would indeed create the futile heat loss reviewer 1 is concerned about.

      An alternate hypothesis to the actions of intracellular K+ on the DAT transport cycle would be to propose the presence of a regulatory K+ binding site. We are reluctant to assume this mechanism for the simple reason that there is little evidence for such sites from the available crystal structures. The view that K+ binds to Na2 site in DAT, NET and SERT is consistent with our data (see Fig.5). These observations are aided by a previous study that shows K+ can bind to the Na2 site in DAT, as determined by extensive molecular dynamic simulations (Razavi et al., 2017, cited in the manuscript). By its very nature, the Na2 site cannot serve as a regulatory K+ binding site; for the transporter to proceed in the transport cycle, K+ must at some point dissociate from the Na2 site.

      On further scrutiny of our model for DAT, NET and SERT, we noticed that the extra and intracellular affinities for Na+ were set too high. We regret this oversight that arose because we had only simulated experiments in which the intracellular Na+ concentrations had been zero. The selected Na+ affinities would not have allowed the transporter to function properly at a physiological intracellular Na+ concentration (which is ~10 mM). We now rectified this problem by lowering the inner and outer Na+ affinity by a factor of 10. In Fig.7 of the main manuscript and supplementary figure 6, we have now replaced all previous simulations of the three transporters with the predictions of the newly amended model. As seen, the changes in the binding parameters for Na+ in the model could still account for the key findings of this study.

      Reviewer #2:

      Bhat et al. study transport mechanism of three members of the SLC6 family, i.e. DAT, NET and SERT, using a combination of cellular electrophysiology, fluorescence measurements - taking advantage of a fluorescent substrate (APP+) that can be transported by each of these different transporters - and kinetic modelling. They find that DAT, NET and SERT differ in intracellular K+ binding. In DAT and NET, intracellular K+ binding is transient, resulting in voltage-dependent transport. In contrast, SERT transports K+, and the addition of a charged substrate to the transport cycle makes serotonin transport voltage-independent.

      This is an extremely nice and interesting manuscript, based on a series of beautifully designed and executed experiments that are convincingly analyzed via a kinetic model. I have only some suggestions:

      1) Fig. 4: I find the description of Fig. 4 extremely difficult to understand. In clear contrast to the introductory sentence "Previous studies showed that Kin+ was antiported by SERT, but not by NET or DAT (Rudnick & Nelson, 1978; Gu et al., 1996; Erreger et al.,2008), SERT appears to be able to transport APP+ without K+ in Fig. 4. I was trying to understand this obvious discrepancy for a long time, until I found the authors coming back to this point in the discussion "However steady-state assessment of transporter mediated substrate uptake is hindered by the fact that all three monoamine transporters can also transport substrate in the absence of Kin+". This is a little late, and the author should address this point more explicitly in the result section, close to the description of Fig. 4.

      We agree with reviewer 2’s comments pertaining to the SERT data represented in Fig.4C. The observations made from this dataset seem confusing in the absence of any relevant context. We have added the following statements to clarify any discrepancy arising from Fig. 4 (lines 266-273): “Owing to the instrumental role of Kin+ in the catalytic cycle of SERT, the observed lack of difference in APP+ uptake profiles by SERT-expressing cells in the presence or absence of Kin+ seem contradictory. This discrepancy can be explained as follows: 1) SERT can alternatively antiport protons to complete its catalytic cycle (Keyes and Rudnick, 1982; Hasenhuetl et al. 2016) and 2) APP+ is a poor SERT substrate (as determined by lack of APP+ induced steady state currents, Fig. 2F and 3F) that may be shuttled into SERT-expressing cells at rates slower than the rate limiting isomerization of SERT from inward open to outward open state.”

      2) Throughout the whole manuscript I am missing statistical details in comparisons.

      Statistical details for comparisons, which were done on some data sets in Fig. 4, Fig.5 and Fig.6, have now been incorporated in the manuscript text.

      3) Since APP+ might also only bind to the transporter or even only bind to the cell membrane, the authors might want to look at how the time course of the cellular APP+ signal depends on the size of the cells or on the ratio of transport currents and capacitance. It is of course possible that the tested cells do not differ sufficiently in size to permit such comparison. The authors should at least comment on this possibiliy.

      We are working on monoclonal lines. Thus, the differences in cell size are small (between 25- 30 pF). In the new supplementary figure 1, we show that our (previously held) conjecture that the fast component represents membrane binding was wrong. In fact, analysis of the APP+ fluorescence in control cells (supplementary figure 1D) suggests that APP+ adherence to the plasma membrane does not contribute to significant fluorescence signal. We apologize for this misinterpretation and please refer to the responses to reviewer 1 for more details.

      4) Another set of results one might look at are the time courses of fluorescence decay after the end of the APP+ perfusion (Fig. 2 and 4). Substrate (APP+) outward transport should have a comparable voltage dependence as substrate uptake, moreover it should depend on the amount of substrate that entered to the cell before. Could the authors provide such result and use them to exclude specific/unspecific APP+ binding?

      In supplementary figure 1 (panel, A and C) and video files 1 and 2, we show that APP+ adheres to intracellular membranes of organelles. This has also been shown previously by others (Solis Jr. et al., 2012; Karpowicz Jr et al., 2013; Wilson et al., 2014, cited in the manuscript). Because these structures serve as sinks, there is no (or only little) free APP+, which is available for outward transport.

      Reviewer #3:

      The sodium-coupled biogenic transporters DAT, NET and SERT, terminate the synaptic actions of dopamine, norepinephrine and serotonin, respectively. They belong to the family of Neurotransmitter:sodium:symporters. These transporters have very similar sequences and this is reflected at the structural level as judged by similarity of the crystal structures of the outward-facing conformations DAT and SERT. However, earlier functional studies indicated that transport by SERT is electroneutral because the charges sodium ions and substrate moving into the cell are compensated by the outward movement of potassium ions (or protons) to complete the transport cycle. On the other hand, DAT and NET are electrogenic. Moreover, potassium ions are not extruded by these transporters and the Authors set out to investigate if the electrogenicity is related to difference in potassium handling between SERT and the two other biogenic transporters. This was done by analyzing the role of intracellular cations and voltage on substrate transport by the three biogenic amine transporters. This was achieved by the simultaneous recording of uptake of the fluorescent substrate APP+ and the current induced by this process under voltage-clamp conditions by single HEK293 cells expressing the transporters. The Authors found that even though uptake by NET and DAT did not require internal potassium, these transporters could actually interact with internal potassium as judged by the voltage dependence of the so-called peak current. This voltage dependence was very steep in the absence of both sodium and potassium. However, in the presence of either cation this voltage dependence became less steep when either of these cations was present in the internal milieu, indicating that not only sodium but also potassium could bind from the inside. The same result was obtained with SERT. However, uptake by SERT was found to be much less dependent on the membrane voltage than that by DAT and NET and was stimulated by internal potassium, consistent with the proposed electroneutrality of the former. The observations indicate that the structural similarity of the three biogenic amine transporters is also reflected in their ability to bind potassium, even though this cation can translocate to the outside only in SERT.

      Strengths:

      Development of a sophisticated technique to interrogate the mechanism of sodium coupled biogenic amine transport in single cells. Rigorous analysis of the data. Conclusions supported by the data. The methodology can be used to obtain novel insights into the mechanism of other transporters.

      Weaknesses:

      The presentation could be made more "user friendly" by explaining in more detail what is happening as we go through the data. For instance, peak and steady state currents are shown already in Figure 1, but an (too brief) explanation is only provided when describing Figure 5. A schematic in the first part of the Results would be useful. Some information of on the structural background should be provided as well as a full description of the transport cycle, namely the number of sodium ions translocated per cycle and the argument why chloride remains bound to the transporter throughout the cycle. The control that in contrast to potassium, lithium is inert should be performed not only for DAT, but also for the two other transporters.

      We thank Dr. Kanner for these recommendations. Regarding the role of Na+ and Cl- in the transport cycle of the monoamine transporters, we have briefly mentioned the same in the introduction as follows: “The crystal structure of both hSERT and dDAT show two bound Na+ ions. However, only one Na+ ion is thought to be released on the intracellular side in both transporters (Rudnick & Sandtner, 2019). Cl-, on the other hand, has been shown to play a modulatory role in the transport cycle of SERT and DAT, but Cl- is not essential for the transport stoichiometry (Erreger et al., 2008; Hasenhuetl et al., 2016).”

      As for the control experiments with Li+, we are very grateful to Dr. Kanner for his suggestions. En route to extending the observations, which we obtained with DAT in the presence of high intracellular Li+, to NET and SERT, we stumbled upon some unexpected results: while IV relations of peak currents with high intracellular Li+ or NMDG+ in NET were identical (similar to DAT), SERT gave us exactly the opposite profiles. IV relations of high intracellular Li+ in SERT were as shallow as those in the presence of high +++ intracellular K or high intracellular Na . This is indicative of intracellular Li binding to SERT, an observation not previously reported that further highlights the differences in DAT/NET and SERT in cation binding. We believe that our observations with Li+ and SERT could be expanded on in a separate story. We have accordingly changed the manuscript text in the Results and Discussion as follows:

      Results (lines 320-337):

      “Because the absence of Kin+ affected the slope of the IV-relation of the peak current, we surmised that potassium bound from the intracellular side not only to SERT but also possibly to DAT and NET. We explored this conjecture by determining the IV relation of peak currents through all three +++ transporters in the presence of lithium (Liin = 163 mM) instead of Kin . Li is believed to be an inert cation, because it does not support substrate translocation by SLC6 transporters. As expected, the IV relation of peak currents through DAT and NET were similar in the presence of 163 mM Lin+ to those recorded in the absence of Kin+ (cf., diamond and triangle symbol in Fig. 5J and 5K). These observations clearly indicate that Kin+ binds to both DAT and NET and rule out an alternative explanation, i.e. that the effect can be accounted for water and monovalent cations briefly occupying a newly available space in the inner vestibule. SERT, on the hand, show shallow IV relations of peak currents with high Liin+ when compared to those acquired in the absence of Kin+ (cf., diamond and triangle symbol in Fig. 5L). This is indicative of Liin+ binding to SERT on the intracellular side. The exact nature of Liin+ binding to SERT has not been reported previously and warrants further investigation. The IV relations of peak currents are similar in the presence of 163 mM Kin+ (Fig. 5A-C) and of 163 mM Nain+ (Fig. 5G-I) in DAT, NET and SERT (cf. circle and square symbols in Fig. 5J-L). This is consistent with the idea that Nain+ and Kin+ bind to overlapping sites in these transporters. “

      Discussion (lines 524-527):

      “Interestingly, differences between DAT/NET and SERT are further substantiated by the ability of SERT+ to bind to intracellular Li . The exact nature of this interaction is unknown and necessitates an in-depth investigation that is beyond the scope of this study.”

    1. Author Response

      Reviewer #1 (Public Review):

      This paper is a follow-up of the authors previous paper (2018), in which they carefully described the organisation of the junctions between cells of the adult Drosophila midgut epithelium and their control from the basal side by integrin signalling. Here, the authors used state-of-the art imaging and genetics to unravel step-by-step the events leading from an initially unpolarised cell to an epithelial cell that integrates into the existing epithelium. Many of the images are accompanied by cartoons, which help the reader to better understand the images and follow the conclusions. It would have been helpful yet, in particular with respect to the mutant phenotypes described later, if they would have named each of the steps/stages. In addition, mentioning the timescale would give an idea about the temporal frame in which this process elapses.

      We have used terms such as “unpolarised cells, polarised Actin/Cno” to label different stages in Figure 6, since this sequence of steps is inferred from results obtained from fixed samples with still images. We have illustrated the septate junction mutant phenotype in Figure 8I.

      We have also performed a new experiment to estimate the time taken for an activated EB to form a PAC and to become a mature enterocyte using overexpressing Sox21a with esg[ts]>GFP to induce enteroblast differentiation. Counting the number of GFP+ve cells without PAC, with a PAC and with full apical domain at different time points suggests that activated EBs take about a day to form a PAC and another day to form a fully-integrated enterocyte. We have summarised the results in Figure 5-figure supplement 1C.

      We have also included this result in the main-text as “ To estimate the time taken for enteroblasts to progress to pre-enterocytes with a PAC, and for pre-enterocytes become to enterocytes, we induced enterocyte differentiation by over-expressing UAS-Sox21a under the control of esg[ts]-Gal4 and counted the number of GFP+ve cells without a PAC or apical domain, with a PAC and with a full apical domain at different time points after induction (Chen et al., 2016; Meng and Biteau, 2015; Zhai et al., 2017). 17 hours after shifting the flies to 25ºC to inactivate Gal80ts, almost no GFP+ve cells had progressed to pre-EC with a PAC (0.1%) or EC (1%), and these few cells probably started to differentiate before Sox 21a induction. 24 hours later, 10% of the GFP+ve cells had developed into pre-ECs with a PAC and 20% had become ECs (Figure 5-figure supplement 1B-C). After an additional 24 hours, the number of cells with a PAC fell to 1%, whereas 50% were ECs. Assuming that it takes 12-17 hours to induce high levels of Sox21a expression, these results suggest that most activated EBs take about 24 hours to develop into a pre-EC with a PAC and a further 24 hours to differentiate into a mature EC, although some cells differentiate faster. This time frame is in agreement with a previous study using similar approaches to accelerate differentiation (Rojas Villa et al., 2019) and a recent live imaging study tracing the enteroblast to enterocyte transition (Tang et al., 2021). These results also indicate that down-regulation of Sox21a is not essential for enteroblast to pre-enterocyte differentiation, since enteroblasts overexpressing Sox21a still from a PAC (Figure 5-figure supplement 1B).

      The authors convincingly show that septate junctions are instrumental for proper polarisation and integration of the enteroblast. However, while they nicely showed that Canoe in neither required in the enteroblast nor in the enterocytes for this process, it remains unclear whether septate junction proteins are required in enteroblast or in enterocytes or in both and at which particular step the process fails in the mutant.

      Early stage enteroblasts neither express or require septate junction proteins, whereas late stage enteroblasts and pre-enterocytes do (Chen et al., 2020; Hung et al., 2020; Izumi et al., 2019; Xu et al., 2019). Since cells mutant for septate junction proteins do not develop into mature enterocytes with an apical domain facing the gut lumen, we cannot answer the reviewer’s question of whether septate junction proteins are required in enterocytes.

      As we discussed in the paper, we think that “differentiating enteroblasts only require a basal cue to establish their initial apical-basal polarity, whereas the formation of the pre-assembled apical compartment also requires a junctional cue. The septate junctions are not necessary for apical domain formation per se, however, as mesh mutant enteroblasts form a full-developed apical domain with a brush border inside the cell. This suggests that septate junctions define the site of apical domain formation by delimiting the region where apical membrane proteins are secreted to assemble the brush border, but do not control the process of apical domain formation directly.”

      Reviewer #2 (Public Review):

      The authors recently showed the polarization of the cells of the adult Drosophila midgut does not require any of the canonical epithelial polarity factors, and instead depend on basal cues from adhesion to the ECM, as well as septate junction proteins (Chen et al, 2018). Here they extend this research to examine in greater detail precisely how midgut epithelial cells integrate in the pre-exisiting epithelium and become polarized. Surprisingly, they show that enteroblasts form an apical membrane initiation site prior to polarizing. Furthermore, they show that this develops into a pre-apical compartment containing fully-formed brush border. This is a very interesting finding - it explains how integrating enteroblasts can integrate into a pre-existing epithelium without disrupting barrier function. The conclusions of this paper are mostly well supported by data, but some aspects could do with being clarified and extended as outlined below.

      Model presented in Figure 6

      While the separation of membranes indicated in Figure 6 steps 3-5 can be seen in the image shown in Figure 3B, this is one of the only images which supports the idea that there is a separation of membranes between the enteroblast and overlying enterocytes during PAC formation. Is the model in Figure 6 supported by EM data - can you see a region where there is brush border and separation of cells? Supplementing Figure 3 with corresponding EM images would greatly aid the reader in interpreting the data and strengthen the model.

      We think that AJ clearing and membrane separation is a brief process that is quickly followed by the separation of the apical and junctional proteins and apical secretion at the AMIS to form the PAC. We have not captured this stage in our EM images, but have many other examples that show this step (e.g Figure 4C and Figure 8F). Another example is shown below.

      A key step in the model is that the clearance of E-Cadherin from the apical membrane leads to a loss of adhesion between the enteroblast and the overlying enterocytes. This would need to be supported by functional data such as overexpression of E-Cad or E-CadDN in enteroblasts or by generating shg mutant clones. If the model is correct, perturbing E-Cad levels in enteroblasts should lead to defects in PAC formation, such as loss of de-adhesion/early de-adhesion/excessive de-adhesion.

      We think it is the local clearance of ECad from the apical membrane, not the downregulation of total level of ECad that is important for the local membrane separation and future PAC formation. The experiment of overexpressing ECad or ECad-DN proposed by the reviewer might be crucial to demonstrate the importance of total amount of ECad, but might not be very helpful in determining the importance of membrane separation in the PAC formation. Moreover, AJ formation in fly midgut epithelium does not depend on ECad, suggesting that ECad and NCad act redundantly which further complicates this approach (Choi et al., 2011; Liang et al., 2017).

      Role for the septate junction proteins

      Septate junction proteins were previously shown by these authors to be required for enteroblast polarization and integration into the midgut epithelium (Chen et al, 2018). Here they extend this by examining enteroblasts mutant for septate junction proteins, and conclude that septate junction proteins are required for normal PAC formation. However, it is not clear what aspect of the polarization of the enteroblasts is disrupted, because a number of mesh mutant cells (albeit a lower proportion than in wildtype) do form PACs. The main phenotype seems to be that cells fail to polarize (as previously reported) or have internalised PACs. It is hard to know what to conclude from this data about the role of the septate junction components in PAC formation.

      The major phenotype of the septate junction mutants is the loss of polarity, i.e. an inability to form an apical domain and integrate into the epithelial layer as shown in Figure 8. Neither mesh or Tsp2a mutants can form a PAC, even though mesh mutant cells have higher propensity to form an internal PAC-like structure (Figure 8B,C,E,G,H, Figure 8-figure supplement 1L). Thus, we think that septate junctions are required for AMIS and PAC formation. What complicates the interpretation is that some (6-20%) septate junction mutant cells do form an AMIS like structure (Figure 8D-F, Figure 8-figure supplement 1F&K). The simplest explanation for this result is that this is due to perdurance of the wild-type proteins after clone induction, with the weaker phenotype of ssk mutants being due to longer perdurance of this protein. However, we cannot rule out the alternative explanation that AMIS and PAC formation is facilitated by the septate junction proteins, but that they can still form very inefficiently in their absence.

      We realise that this section was quite confusing in the orginal version of the manuscript and have now re-written it to make this interpretation clearer.

      Coracle is used as a readout for the localization of septate junction components, yet the staining for Cora in Figure S3B looks quite different to Mesh in S3D. If Cora is to be used as a readout for the localization of septate junction components, then staining for Cora/Mesh and/or Cora/SSk or Tsp2a should be shown.

      When discussing the requirement for septate junctions for enteroblast integration - Coracle and Mesh are used interchangeably - but as mentioned before, it is not clear if they colocalize, or if their localization is interdependent (as demonstrated for Mesh, Tsp2a and Ssk in Figure 7). What is the phenotype of enteroblasts mutant for cora?

      Following from the previous point - while it is clear that Coracle is apical early during AMIS formation, it is not clear if Mesh, Tsp2a and Ssk also are, yet these are the mutants that are examined for a role in AMIS/PAC formation. It would be good to know whether the loss of cora would lead to defects in AMIS formation.

      The reason we used mainly Coracle as a marker for the septate junctions is that Mesh and Tsp2A localise to the basal labyrinth as well as to the septate junctions which could confuse the reader. We have now added new panels to Figure 3-figure supplement 3E&F showing the colocalization of Cora with Mesh/Tsp2a at the septate junctions and during the crucial stages of PAC formation.

      Additional Results:

      "Coracle is a peripheral septate junction protein whose localisation depends on the structural septate junction components such as Mesh/Ssk/Tsp2a (Chen et al., 2018; Izumi et al., 2016, 2012). Cora antibody staining provides a clearer marker for the septate junctions than Mesh or Tsp2a antibody staining, because the latter also label the basal labyrinth (Figure 3-figure supplement 1E&F). To determine whether Cora is required for PAC formation or epithelial polarity in the adult midgut, we generated a null mutant allele with a premature stop codon in FERM domain using CRISPR. Cells mutant for this allele, corajc, or a second cora null allele, cora5, can form a PAC, septate junctions and a full apical domain, indicating that Cora is also not required for enteroblast integration or enterocyte polarity (Figure 7F&G, Figure 7-figure supplement 1E-H).

      Additional Materials and Methods:

      We used the CRISPR/Cas9 method (Bassett and Liu, 2014) to generate null alleles of canoe and coracle. sgRNA was in vitro transcribed from a DNA template created by PCR from two partially complementary primers:

      forward primer:

      For coracle:

      5′-GAAATTAATACGACTCACTATAGAAGCTGGCCATGTACGGCGGTTTTAGAGCTAGAAATAGC-3′;

      The sgRNA was injected into…Act5c-Cas9 embryos to generate coracle null alleles (Port et al., 2014). Putative…coracle mutants in the progeny of the injected embryos were recovered, balanced, and sequenced. …The coraclejc allele contains a 2bp deletion around the CRISPR site, resulting in a frameshift that leads to stop codon at amino acid 225 in the middle of the FERM domain, which is shared by all isoforms. No Coracle protein was detectable by antibody (DSHB C615.16) staining in both midgut and follicle cell clones. The coraclejc allele was recombined with FRT G13 to make the FRTG13 coraclejc flies.

      It is unclear what is happening in Figure 8A,C,E, S7D. Is that a detachment phenotype or an integration phenotype? Are the majority of cells unpolarised due to loss of integrin attachment rather than failure to form an AMIS/PAC?

      Cells mutant for septate junction proteins do not detach from the basement membrane and still localise Talin basally, as illustrated by the new panel we have added (Figure 8-figure supplement 1N), showing Talin localisation in Tsp2a mutant cell.

      However, because the mutant cells cannot integrate and remain stuck beneath the septate junctions between the enterocytes, they sometimes become displaced from a portion of the basement membrane by younger EBs that derive from the same mutant ISC, leading to a pile up of cells in the basal region of the epithelium (e.g. Figure 8A, E and H).

      We have added the following sentences to the Results, explaining these points:

      "Because the mutant cells remain trapped beneath enterocyte-enterocyte septate junctions, they accumulate in the basal region of the epithelium, with new EBs derived from the same mutant ISC forming beneath them and reducing their contact with the basement membrane (Figure 8A)."

      " The majority of cells mutant for septate junction components fail to polarise or form an AMIS, although they form normal lateral and basal domains, as the basal integrin signalling component, Talin, localises normally (Figure 8-figure supplement 1N)."

      It is unclear whether enteroblasts really pass through an 'unpolarized stage'. In Figure 6, when they are described as 'unpolarised', they clearly have distinct basal and AJ domains. In septate junction mutants, when cells are classified as unpolarized, do they still have distinct regions of integrin/E-Cad expression?

      This is a semantic question. We agree that they have distinct lateral and basal domains, but they do not have an apical domain. In this respect, these "unpolarised" cells are similar to a mesenchymal fibroblast migrating on a substrate, which has a distinct basal side contacting the substrate that is different from the non-contacting regions of the cell surface. They also match the description of the migratory, "mesenchymal" enteroblasts (Antonello et al., 2015). To make this clearer, we have added the following notes to the legend for Figure 6: “Unpolarised” in the second panel of this figure indicates that the enteroblast has not formed a distinct apical domain. At this stage, no marker is clearly apically localised. “unpolarised” or “polarised” in the third and fourth panels describe the localisation of marker proteins, such as Actin and Cno."

    1. Author Response

      eLife assessment

      This important paper exploits new cryo-EM tomography tools to examine the state of chromatin in situ. The experimental work is meticulously performed and convincing, with a vast amount of data collected. The main findings are interpreted by the authors to suggest that the majority of yeast nucleosomes lack a stable octameric conformation. Despite the possibly controversial nature of this report, it is our hope that such work will spark thought-provoking debate, and further the development of exciting new tools that can interrogate native chromatin shape and associated function in vivo.

      We thank the Editors and Reviewers for their thoughtful and helpful comments. We also appreciate the extraordinary amount of effort needed to assess both the lengthy manuscript and the previous reviews. Below, we provide our provisional responses in bold blue font. The majority of the comments are straightforward to address. We have taken a more conservative approach with the subset of comments that would require us to speculate because we either lack key information or we lack technical expertise. Instead of adding the speculative replies to the main text, we think it will be better to leave them in the rebuttal for posterity. Readers will therefore have access to our speculation and know that we did not feel confident enough to include these thoughts in the Version of Record.

      Reviewer #1 (Public Review):

      This manuscript by Tan et al is using cryo-electron tomography to investigate the structure of yeast nucleosomes both ex vivo (nuclear lysates) and in situ (lamellae and cryosections). The sheer number of experiments and results are astounding and comparable with an entire PhD thesis. However, as is always the case, it is hard to prove that something is not there. In this case, canonical nucleosomes. In their path to find the nucleosomes, the authors also stumble over new insights into nucleosome arrangement that indicates that the positions of the histones is more flexible than previously believed.

      We want to point out that canonical nucleosomes are there in wild-type cells in situ, albeit rarer than what’s expected based on our HeLa cell analysis. The negative result (absence of any canonical nucleosome classes in situ) was found in the histone-GFP mutants.

      Major strengths and weaknesses:

      Personally, I am not ready to agree with their conclusion that heterogenous non-canonical nucleosomes predominate in yeast cells, but this reviewer is not an expert in the field of nucleosomes and can't judge how well these results fit into previous results in the field. As a technological expert though, I think the authors have done everything possible to test that hypothesis with today's available methods. One can debate whether it is necessary to have 35 supplementary figures, but after working through them all, I see that the nature of the argument needs all that support, precisely because it is so hard to show what is not there. The massive amount of work that has gone into this manuscript and the state-of-the art nature of the technology should be warmly commended. I also think the authors have done a really great job with including all their results to the benefit of the scientific community. Yet, I am left with some questions and comments:

      Could the nucleosomes change into other shapes that were predetermined in situ? Could the authors expand on if there was a structure or two that was more common than the others of the classes they found? Or would this not have been found because of the template matching and later reference particle used?

      Our best guess (speculation) is that one of the class averages that is smaller than the canonical nucleosome contains one or more non-canonical nucleosome classes. We do not feel confident enough to single out any of these classes precisely because we do not yet know if they arise from one non-canonical nucleosome structure or from multiple – and therefore mis-classified – non-canonical nucleosome structures (potentially with other non-nucleosome complexes mixed in). We feel it is better to leave this discussion out of the manuscript, or risk sending the community on wild goose chases.

      Our template-matching workflow uses a low-enough cross-correlation threshold that any nucleosome-sized particle (plus minus a few nanometers) would be picked, which is why the number of hits is so large. So unless the noncanonical nucleosomes quadrupled in size or lost most of their histones, they should be grouped with one or more of the other 99 class averages (WT cells) or any of the 100 class averages (cells with GFP-tagged histones). As to whether the later reference particle could have prevented us from detecting one of the non-canonical nucleosome structures, we are unable to tell because we’d really have to know what an in situ non-canonical nucleosome looks like first.

      Could it simply be that the yeast nucleoplasm is differently structured than that of HeLa cells and it was harder to find nucleosomes by template matching in these cells? The authors argue against crowding in the discussion, but maybe it is just a nucleoplasm texture that side-tracks the programs?

      Presumably, the nucleoplasmic “side-tracking” texture would come from some molecules in the yeast nucleus. These molecules would be too small to visualize as discrete particles in the tomographic slices, but they would contribute textures that can be “seen” by the programs – in particular RELION, which does the discrimination between structural states. We do not know the inner-workings of RELION well enough to say what kinds of density textures would side-track its classification routines.

      The title of the paper is not well reflected in the main figures. The title of Figure 2 says "Canonical nucleosomes are rare in wild-type cells", but that is not shown/quantified in that figure. Rare is comparison to what? I suggest adding a comparative view from the HeLa cells, like the text does in lines 195-199. A measure of nucleosomes detected per volume nucleoplasm would also facilitate a comparison.

      Figure 2’s title is indeed unclear and does not align with the paper’s title and key conclusion. The rarity here is relative to the expected number of nucleosomes (canonical plus non-canonical). We have changed the title to “Canonical nucleosomes are a minority of the expected total in wild-type cells”. We would prefer to leave the reference to HeLa cells to the main text instead of as a figure panel because the comparison is not straightforward for a graphical presentation. Instead, we will report the total number of nucleosomes estimated for this particular tomogram (~7,600) versus the number of canonical nucleosomes classified (297; 594 if we assume we missed half of them).

      If the cell contains mostly non-canonical nucleosomes, are they really non-canonical? Maybe a change of language is required once this is somewhat sure (say, after line 303).

      This is an interesting semantic and philosophical point. From the yeast cell’s “perspective”, the canonical nucleosome structure would be the form that is in the majority. That being said, we do not know if there is one structure that is the majority. From the chromatin field’s point of view, the canonical nucleosome is the form that is most commonly seen in all the historical – and most contemporary – literature, namely something that resembles the crystal structure of Luger et al, 1997. Given these two lines of thinking, we will add the following clarification after line 303:

      “At present, we do not know what the non-canonical nucleosome structures are, meaning that we cannot even determine if one non-canonical structure is the majority. Until we know what the family of non-canonical nucleosome structures are, we will use the term non-canonical to describe the nucleosomes that do not have the canonical (crystal) structure”.

      The authors could explain more why they sometimes use conventional the 2D followed by 3D classification approach and sometimes "direct 3-D classification". Why, for example, do they do 2D followed by 3D in Figure S5A? This Figure could be considered a regular figure since it shows the main message of the paper.

      Because the classification of subtomograms in situ is still a work in progress, we felt it would be better to show one instance of 2-D classification for lysates and one for lamellae. While it is true that we could have presented direct 3-D classification for the entire paper, we anticipate that readers will be interested to see what the in situ 2-D class averages look like.

      The main message is that there are canonical nucleosomes in situ (at least in wild-type cells), but they are a minority. Therefore, the conventional classification for Figure S5A should not be a main figure because it does not show any canonical nucleosome class averages in situ.

      Figure 1: Why is there a gap in the middle of the nucleosome in panel B? The authors write that this is a higher resolution structure (18Å), but in the even higher resolution crystallography structure (3Å resolution), there is no gap in the middle.

      There is a lower concentration of amino acids at the middle in the disc view; unfortunately, the space-filling model in Figure 1A hides this feature. The gap exists in experimental cryo-EM density maps. See below for an example. The size of the gap depends on the contour level and probably the contrast mechanism, as the gap is less visible in the VPP subtomogram averages. To clarify this confusing phenomenon, we will add the following lines to the figure legend:

      “The gap in the disc view of the nuclear-lysate-based average is due to the lower concentration of amino acids there, which is not visible in panel A due to space-filling rendering. This gap’s size may depend on the contrast mechanism because it is not visible in the VPP averages.”

      Reviewer #2 (Public Review):

      Nucleosome structures inside cells remain unclear. Tan et al. tackled this problem using cryo-ET and 3-D classification analysis of yeast cells. The authors found that the fraction of canonical nucleosomes in the cell could be less than 10% of total nucleosomes. The finding is consistent with the unstable property of yeast nucleosomes and the high proportion of the actively transcribed yeast genome. The authors made an important point in understanding chromatin structure in situ. Overall, the paper is well-written and informative to the chromatin/chromosome field.

      We thank Reviewer 2 for their positive assessment.

      Reviewer #3 (Public Review):

      Several labs in the 1970s published fundamental work revealing that almost all eukaryotes organize their DNA into repeating units called nucleosomes, which form the chromatin fiber. Decades of elegant biochemical and structural work indicated a primarily octameric organization of the nucleosome with 2 copies of each histone H2A, H2B, H3 and H4, wrapping 147bp of DNA in a left handed toroid, to which linker histone would bind.

      This was true for most species studied (except, yeast lack linker histone) and was recapitulated in stunning detail by in vitro reconstitutions by salt dialysis or chaperone-mediated assembly of nucleosomes. Thus, these landmark studies set the stage for an exploding number of papers on the topic of chromatin in the past 45 years.

      An emerging counterpoint to the prevailing idea of static particles is that nucleosomes are much more dynamic and can undergo spontaneous transformation. Such dynamics could arise from intrinsic instability due to DNA structural deformation, specific histone variants or their mutations, post-translational histone modifications which weaken the main contacts, protein partners, and predominantly, from active processes like ATP-dependent chromatin remodeling, transcription, repair and replication.

      This paper is important because it tests this idea whole-scale, applying novel cryo-EM tomography tools to examine the state of chromatin in yeast lysates or cryo-sections. The experimental work is meticulously performed, with vast amount of data collected. The main findings are interpreted by the authors to suggest that majority of yeast nucleosomes lack a stable octameric conformation. The findings are not surprising in that alternative conformations of nucleosomes might exist in vivo, but rather in the sheer scale of such particles reported, relative to the traditional form expected from decades of biochemical, biophysical and structural data. Thus, it is likely that this work will be perceived as controversial. Nonetheless, we believe these kinds of tools represent an important advance for in situ analysis of chromatin. We also think the field should have the opportunity to carefully evaluate the data and assess whether the claims are supported, or consider what additional experiments could be done to further test the conceptual claims made. It is our hope that such work will spark thought-provoking debate in a collegial fashion, and lead to the development of exciting new tools which can interrogate native chromatin shape in vivo. Most importantly, it will be critical to assess biological implications associated with more dynamic - or static forms- of nucleosomes, the associated chromatin fiber, and its three-dimensional organization, for nuclear or mitotic function.

      Thank you for putting our work in the context of the field’s trajectory. We hope our EMPIAR entry, which includes all the raw data used in this paper, will be useful for the community. As more labs (hopefully) upload their raw data and as image-processing continues to advance, the field will be able to revisit the question of non-canonical nucleosomes in budding yeast and other organisms.

    1. Author Response:

      Reviewer #1:

      In this paper, Wammes et al. used fMRI to investigate changes in representational similarity of temporally paired images in hippocampal subfields. The stimuli were designed to parametrically vary in their visual similarity so that individual pairs covered the entire range of visual overlap, which was behaviourally validated by a separate sample of participants. The authors compared the neural patterns evoked by these pairs of stimuli before and after participants completed a statistical learning task. The findings showed that pre- to post-learning, representations in the dentate gyrus reconfigured to fit a cubic model, consistent with the non-monotonic plasticity hypothesis (NMPH).

      This is an interesting, novel approach with a clever stimulus manipulation which addresses a gap in the current literature. The study is well-motivated by theory, the analyses are appropriate and clearly described, the implemented controls are carefully designed, and the manuscript is well-written. However, it is unclear whether the same principles necessarily generalize beyond visual similarity, and whether these neural patterns meaningfully relate to behaviour.

      1) The analytic approach is well-designed and the results clearly address the hypotheses. However, it seems like the conclusions might be dependent on this learning paradigm, which should be discussed in a bit more detail and made clearer. The present statistical learning approach is somewhat implicit in its nature and relies on the participants gradually recognizing the temporal links between stimuli. In contrast, in most prior studies cited in the present manuscript, participants were explicitly instructed to make associations between stimuli that either occurred on the screen simultaneously, or relatively far apart in time (i.e., not successively). This top-down influence likely plays an important role. Even beyond experimental paradigms - we often make connections between similar experiences that occurred far apart in time, and cannot always rely on temporal contingencies. The step between previous work and statistical learning needs to be made clearer and more explicit.

      Although our current approach involves a more implicit statistical learning task, the hypothesized non-monotonic plasticity is a general mechanism that has been and can be applied across tasks. We used temporal contingency to create a situation where representations were concurrently active. However, prior work has used other manipulations, such as linking to a shared associate. We have modified and expanded both the Introduction and Conclusion to emphasize this broader context and highlight directions for future work.

      See Introduction (p. 4, lines 60-74): “The NMPH has been put forward as a learning mechanism that applies broadly across tasks in which memories compete, whether they have been linked based on incidental co-occurrence in time or through more intentional associative learning (Ritvo et al., 2019). The NMPH can explain findings of differentiation in diverse paradigms (e.g., linking to a shared associate: Chanales et al., 2017; Favila et al., 2016; Schlichting et al., 2015; Molitor et al., 2020; retrieval practice: Hulbert & Norman, 2015; statistical learning: Kim, Norman, & Turk-Browne, 2017) by positing that these paradigms induced moderate coactivation of competing memories. Likewise, relying on the same parameter of coactivation, the NMPH can explain seemingly contradictory findings showing that shared associates (Collin et al., 2015; Milivojevic et al., 2015; Schlichting et al., 2015; Molitor et al., 2020) and co-occurring items (Schapiro et al., 2012; Schapiro, Turk-Browne, Norman, & Botvinick, 2016) can lead to integration by positing that — in these cases — the paradigms induced strong coactivation. Importantly, although the NMPH is compatible with findings of both differentiation and integration across several paradigms with diverse task demands, the explanations above are post hoc and do not provide a principled test of the NMPH’s core claim that there is a continuous, U-shaped function relating the level of coactivation to representational change.

      See Introduction (p. 5, lines 83-86): “No existing study has demonstrated the full U- shaped pattern for representational change; that is what we set out to do here, using a visual statistical learning paradigm — specifically, we brought about coactivation using temporal co-occurrence between paired items, and we manipulated the degree of coactivation by varying the visual similarity of the items in a pair.”

      See Conclusion (p. 18, lines 370-374): “From a theoretical perspective, these results provide the strongest evidence to date for the NMPH account of hippocampal plasticity. We expect that a similar U-shaped function relating coactivation and representational change will manifest in paradigms with different task demands and stimuli, but additional work is needed to provide empirical support for this claim about generality.”

      2) Related to the point above - the timecourse over which such statistical learning occurs should be discussed. If I understood correctly, all of the learning occurred in the 6 scanned blocks between the two templating runs. Does the NMPH predict that the hippocampal patterns should immediately reconfigure depending on visual input, or only reconfigure once the participants encode the links between paired stimuli? If the pattern consistent with the NMPH is immediately evident, this would suggest that the present findings, while very convincing, might not be governed by the same mechanisms as integration/differentiation in memory. It seems unlikely that participants would immediately attempt to link these complex visual stimuli, especially as the cover task was orthogonal. To this end, it would be helpful to see any kind of analysis evaluating representations across the 6 statistical learning runs.

      The reviewer correctly describes that learning took place over the six blocks between templating runs. We agree that observing the emergence of representational change across those runs would be ideal. Unfortunately, however, our design is not compatible with this analysis. Because the pairs were learned from deterministic transition probabilities, the onsets of the paired stimuli were correlated in time. When these correlated events are convolved with the slow hemodynamic response, the responses to the paired stimuli cannot be reliably distinguished. Also, the response to the second stimulus in a pair would be affected by visual similarity to its preceding stimulus as a result of adaptation/repetition suppression, confounding comparisons across conditions. These problems are precisely why we employed a pre/post design in which to-be/previously paired stimuli are presented independently in a random order. This allows for the assessment of representational similarity unconfounded with correlated onsets or adaptation.

      Although we cannot provide a sense of the learning trajectory, we now highlight this design decision, acknowledge the limitation, and highlight this as an opportunity for future work with other more time-resolved modalities or with (random) representational assessments interdigitated with the learning blocks.

      See Discussion (p. 17, lines 358-366): “Finally, although analyzing representational overlap in templating runs before and after statistical learning afforded us the ability to quantify pre-to-post changes, our design precluded analysis of the emergence of representational change over time. That is, we could not establish whether integration or differentiation occurred early or late in statistical learning. This is because, during statistical learning runs, the onsets of paired images were almost perfectly correlated, meaning that it was not possible to distinguish the representation of one image from its pairmate. Future work could monitor the time course of representational change, either by interleaving additional templating runs throughout statistical learning (although this could interfere with the statistical learning process), or by exploiting methods with higher temporal resolution where the responses to stimuli presented close in time can more readily be disentangled.”

      3) In the Introduction and Discussion, the authors focus on learning and discuss the integration/differentiation of memories. To establish a link between the reported hippocampal representations and behaviour, it would be helpful to show evidence of a link between neural differentiation and measures of statistical learning such as priming.

      As the reviewer alluded to earlier, our behavioral task is orthogonal to the manipulation of temporal co-occurrence. Accordingly, we do not have any behavioral data on which we could conduct such an analysis. We fully acknowledge the value of this suggestion and now describe this as a limitation and area for future research.

      See Discussion (p. 17, lines 350-357): “Prior work in this area has demonstrated brain- behavior relationships (Favila et al., 2016; Molitor et al., 2020), so it is clear that changes in representational overlap (i.e., either integration or differentiation) can bear on later behavioral performance. However, in the current work, our behavioral task was intentionally orthogonal to the dimensions of interest (i.e., unrelated to temporal co- occurrence and visual similarity), limiting our ability to draw conclusions about potential downstream effects on behavior. We believe that this presents a compelling target for follow-up research. Establishing a behavioral signature of both integration and differentiation in the context of nonmonotonic plasticity will not only clarify the brain-behavior relationship, but also allow for investigations in this domain without requiring brain data.”

      4) From the authors' predictions (and Fig 1), it might follow that participants who show steeper slopes in early visual regions (i.e., higher correspondence to stimulus similarity) pre-learning might also show a stronger cubic trend in the hippocampus. It would be useful to show within-participant analyses to link visual processing regions to hippocampal representations.

      What a fantastic suggestion! To test this prediction, we extracted the linear coefficients in the visual similarity analysis from cortical ROIs (V1, V2, LO, IT, FG, PHC, PRC, and EC) and the cubic model fit in the representational change analysis from the key hippocampal ROI (DG). Linearity during the initial templating run in PRC was associated with stronger non-monotonicity in DG. The full reporting of these analyses is now included in the figure supplements and referenced in the main text.

      See Results, subsection Representational Change (p. 12, lines 228-229): “Interestingly, in an exploratory analysis, we found that the degree of model fit in DG was predicted by the extent to which visual representations in PRC tracked model similarity (see Figure 4—figure supplement 2).”

      Reviewer #2:

      The authors apply neural network modeling and representational analysis of fMRI data to testing the ability of the theoretical framework under the "non-monotonic-plasticity hypothesis" to explain how hippocampal subdivisions represent similarity and distinctiveness between events. They suggest that the dentate gyrus subfield, in particular, was sensitive to the degree of overlap between experiences, and changes how it favored distinctiveness or similarity in its representation of associated stimuli in a non-monotonic manner.

      Overall, the work builds logically on prior evidence from this group focused on how cortical representations influence memory, and leverages a compelling theoretical framework to reconcile some conflict in the literature on how hippocampal representations respond to overlap.

      The primary confusion and concern with the current manuscript was on the theoretical side. It was not wholly clear from the literature review why DG was the predicted locus of the non-monotonic representational relationship observed, and how the findings fit with extant data from rodent work.

      Thank you for providing an opportunity to better motivate our work. We have updated the paragraph justifying our focus on the hippocampus and on DG in particular.

      See Introduction (p. 8, lines 122-147): “We and others have previously hypothesized that nonmonotonic plasticity applies widely throughout the brain (Ritvo et al., 2019), including sensory regions (e.g., Bear, 2003). In this study, we focused on the hippocampus because of its well-established role in supporting learning effects over relatively short timescales (e.g., Favila et al., 2016; Kim et al., 2017; Schapiro et al., 2012). Importantly, we hypothesized that, even if nonmonotonic plasticity occurs throughout the entire hippocampus, it might be easier to trace out the full predicted U-shape in some hippocampal subfields than in others. As discussed above, our hypothesis is that representational change is determined by the level of coactivation — detecting the U-shape requires sweeping across the full range of coactivation values, and it is particularly important to sample from the low-to-moderate range of coactivation values associated with the differentiation ‘dip’ in the U-shaped curve (i.e., the leftmost side of the inset in Fig. 1). Prior work has shown that there is extensive variation in overall activity (sparsity) levels across hippocampal subfields, with CA2/3 and DG showing much sparser codes than CA1 (Barnes, McNaughton, Mizumori, Leonard, & Lin, 1990; Duncan & Schlichting, 2018). We hypothesized that regions with sparser levels of overall activity (DG, CA2/3) would show lower overall levels of coactivation and thus do a better job of sampling this differentiation dip, leading to a more robust estimate of the U-shape, compared to regions like CA1 that are less sparse and thus should show higher levels of coactivation (Ritvo et al., 2019). Consistent with this idea, human fMRI studies have found that CA1 is relatively biased toward integration and CA2/3/DG are relatively biased toward differentiation (Dimsdale-Zucker et al., 2018; Kim et al., 2017; Molitor et al., 2020). Zooming in on the regions that have shown differentiation in human fMRI (CA2/3/DG), we hypothesized that the U-shape would be most visible in DG, for two reasons: First, DG shows sparser activity than CA3 (Barnes et al., 1990; GoodSmith et al., 2017; West, Slomianka, & Gundersen, 1991) and thus will do a better job of sampling the left side of the coactivation curve. Second, CA3 is known to show strong attractor dynamics (‘pattern completion’; McNaughton & Morris, 1987; Rolls & Treves, 1998; Guzowski, Knierim, & Moser, 2004) that might make it difficult to observe moderate levels of coactivation. For example, rodent studies have demonstrated that, rather than coactivating representations of different locations, CA3 patterns tend to sharply flip between one pattern and the other (e.g., Leutgeb, Leutgeb, Moser, & Moser, 2007; Vazdarjanova & Guzowski, 2004).”

      Additionally, the theoretical model (nicely illustrated in the manuscript) is considered in a somewhat biological-network-agnostic level. Some assumption for how context changes over time, how prior representations are maintained over time, etc., are important for non-monotonic relationships between representations and memory to manifest in the model, but the manuscript does not provide much discussion of their plausibility. This was particularly notable in terms of the emphasis given in the fMRI data to different hippocampal subfields, but not much discussion given on whether/why the model framework is static across subfields (in terms of how context and item information are represented and connected).

      We appreciate this nudge to discuss these additional subfield-specific factors; we have added a paragraph to the Discussion that addresses these issues.

      See Discussion (p. 16, lines 318-336): “Although we focused above on differences in sparsity when motivating our predictions about subfield-specific learning effects, there are numerous other factors besides sparsity that could affect coactivation and (through this) modulate learning. For example, the degree of coactivation during statistical learning will be affected by the amount of residual activity of the A item during the B item’s presentation in the statistical learning phase. In Figure 1, this residual activity is driven by sustained firing in cortex, but this could also be driven by sustained firing in hippocampus; subfields might differ in the degree to which activation of stimulus information is sustained over time (see, e.g., the literature on hippocampal time cells: Eichenbaum, 2014; Howard & Eichenbaum, 2013), and activation could be influenced by differences in the strength of attractor dynamics within subfields (e.g., Neunuebel & Knierim, 2014; Leutgeb et al., 2007). Also, in Figure 1, the learning responsible for differentiation was shown as happening between ‘perceptual conjunction’ neurons and ‘context’ neurons in the hippocampus. Subfields may vary in how strongly these item and context features are represented, in the stability/drift of the context representations (DuBrow, Rouhani, Niv, & Norman, 2017), and in the interconnectivity between item and context features (Witter, Wouterlood, Naber, & Van Haeften, 2000). It is also likely that some of the relevant plasticity between item and context features happens across, in addition to within, subfields (Hasselmo & Eichenbaum, 2005). For these reasons, exploring the predictions of the NMPH in the context of biologically detailed computational models of the hippocampus (e.g., Schapiro, Turk-Browne, Botvinick, & Norman, 2017; Frank, Montemurro, & Montaldi, 2020; Hasselmo & Wyble, 1997) will help to sharpen predictions about what kinds of learning should occur in different parts of the hippocampus.

      As such, this review was very positive, and found the methods to be sound and the conclusions to be solid. There was some room for improvement in how the theoretical foundation was presented for the hippocampal subregion fMRI predictions and for the conceptualization of the neural network memory model.

      We agree with the reviewer that more justification of our specific hippocampal predictions was required and we are grateful for their suggestions.

    1. Author Response

      Reviewer #1 (Public Review):

      This study explores the mechanisms responsible for reduced steroidogenesis of adrenocortical cells in a mouse model of systemic inflammation induced by LPS administration. Working from RNA and protein profiling data sets in adrenocortical tissue from LPS-treated mice they report that LPS perturbs the TCA cycle at the level of succinate dehydrogenase B (SDHB) impairing oxidative phosphorylation. Additional studies indicate these events are coupled to increased IL-1β levels which inhibit SDHB expression through DNA methyltransferase-dependent DNA methylation of the SDHB promoter.

      In general, these are interesting studies with some novel implications. I do, however, have concerns with some of the author's rather broad conclusions given the limitations of their experimental approach. The paper could be improved by addressing the following points:

      1) The limitations of using LPS as the model for systemic inflammation need to be explicitly described.

      We thank the Reviewer for this suggestion. Indeed, the LPS model has several limitations as a preclinical model of sepsis, which are outlined in the revised Discussion. Despite its limitations, we chose this model over other models of sepsis, such as the cecal slurry model, due to its high reproducibility, which enabled the here presented mechanistic studies.

      2) The initial in vivo findings, which support the proposed metabolic perturbation, are based on descriptive profiling data obtained at one time point following a single dose of LPS. The author's conclusion that the ultimate transcriptional pathway identified hinges critically on knowledge of the time course of this effect following LPS, which is not adequately addressed in the paper. How was this time and dose of LPS established and are there data from different dose and time points?

      We thank the Reviewer for raising this question, which we indeed addressed at the beginning of our studies in order to determine a suitable time point and dose of LPS treatment. We chose 6 h as a suitable starting time point to perform transcriptional analyses, based on the fact that LPS triggers transcriptional changes in the adrenal gland and other tissues within the range of few hours (1-3). Confirming our expectations we found 2,609 differentially expressed genes (Figure 1a) in the adrenal cortex of LPS-treated mice among which many were involved in cellular metabolism (Figure 1d,e, 2a-e, Table 1, Table 2). Acute transcriptional changes, which are more likely to reflect direct effects of inflammatory signals compared to changes occurring at later time points (for instance in the range of days), would allow us to mechanistically investigate the effects of inflammation in the adrenal gland, which was the purpose of our studies. Hence, we were guided by the transcriptional changes observed at 6 h of LPS treatment and established the hypothesis that disruption of the TCA cycle in adrenocortical cells is key in the impact of inflammation on adrenal function. Along this line, we analyzed the metabolomic profile of the adrenal gland at 6 and 24 h of LPS treatment. At 6 h succinate levels as well as the succinate / fumarate ratio remained unchanged (Author response image 1A), while at 24 h post-injection these were increased by LPS (Author response image 1B, Figure 2l,o,q). The time delay of the increase in succinate levels (observed at 24 h) following downregulation of Sdhb mRNA expression (at 6 h) can be explained by the time required for reduction of SDHB protein levels, which is dependent on the protein turnover suggested to be approximately 12 h in HeLa cells (4). Based on these findings, all further metabolomic analyses were performed at 24 h of LPS treatment.

      Author response image 1. LPS increases the succinate/fumarate ratio at 24 but not 6h. Mice were i.p. injected with 1 mg/kg LPS and 6 h (A) and 24 h (B) post-injection succinate and fumarate levels were determined by LC-MS/MS in the adrenal gland. n=8-10; data are presented as mean ± s.e.m. Statistical analysis was done with two-tailed Mann-Whitney test. *p < 0.05.

      Having established the most suitable time points of LPS treatments to observe induced transcriptional and metabolic changes, we set out to define the LPS dose to be used in subsequent experiments. The data shown in Author response image 1, were acquired after treatment with 1 mg/kg LPS. This is a dose that was previously reported to cause transcriptional re-profiling of the adrenal gland (1, 2). However, 5 mg/kg LPS, similarly to 1 mg/kg LPS, also reduced Sdhb, Idh1 and Idh2 expression at 4 h (Author response image 2A) and increased succinate and isocitrate levels at 24 h (Author response image 2B) in the adrenal gland. Given that the effects of 1 and 5 mg/kg LPS were similar, for animal welfare reasons we continued our studies with the lower dose.

      Author response image 2. Five mg/kg LPS downregulate Sdhb, Idh1 and Idh2 expression and increase succinate and isocitrate levels in the adrenal gland of mice. Sdhb, Idh1 and Idh2 expression (A) and succinate and isocitrate levels (B) were assessed in the adrenal gland of mice treated with 5 mg/kg LPS for 4 h (A) and 24 h (B). n=5; data are presented as mean ± s.d. Statistical analysis was done with two-tailed Mann-Whitney test. p < 0.05, *p < 0.01.

      3) Related to the point above, the authors data supporting a break in the TCA cycle would be strengthened direct biochemical assessment (metabolic flux analysis) of step kin the TCA cycle process impacted.

      We entirely agree with the Reviewer and considered performing TCA cycle metabolic flux analyses in adrenocortical cells. Unfortunately, the low yield of adrenocortical cells per mouse (approx. 3,000- 6,000) does not allow the performance of metabolic flux experiments, which require higher cell numbers per sample, several time points per condition and an adequate number of replicates per experiment. Moreover, NCI-H295R cells being adrenocortical carcinoma cells are expected to have substantially altered metabolic fluxes compared to normal cells. Since we wouldn’t have the capacity to confirm findings from metabolic flux experiments in NCI-H295R cells in primary adrenocortical cells, as we did for the rest of the experiments, we decided to not perform metabolic flux experiments in NCI-H295R cells. However, performing metabolic flux analyses in adrenocortical cells under inflammatory or other stress conditions remains an important future task that we will pursue upon establishment of a more suitable cell culture system.

      4) The proposed connection of DNMT and IL1 signaling to systemic inflammation and reduced steroidogenesis could be more firmly established by additional studies in adrenal cortical cells lacking these genes.

      We thank the Reviewer for this excellent suggestion. In the revised manuscript we strengthened the evidence for an IL-1β –DNMT1 link and show that DNMT1 deficiency blocks the effects of IL-1β on SDHB promoter methylation (Figure 6k), the succinate / fumarate ratio (Figure 6m), the oxygen consumption rate (Figure 6n) and steroidogenesis (Figure 6o-q) in adrenocortical cells. In order to validate the role of IL-1β in vivo, mice were simultaneously treated with LPS and Raleukin, an IL-1R antagonist. Treatment with Raleukin increased the SDH activity (Figure 6r), reduced succinate levels and the succinate / fumarate ratio (Figure 6s,t) and increased corticosterone production in LPS-treated mice (Figure 6u).

      Reviewer #2 (Public Review):

      The present manuscript provides a mechanistic explanation for an event in adrenal endocrinology: the resistance which develops during excessive inflammation relative to acute inflammation. The authors identify disturbances in adrenal mitochondria function that differentiate excessive inflammation. During severe inflammation the TCA in the adrenal is disrupted at the level of succinate production producing an accumulation of succinate in the adrenal cortex. The authors also provide a mechanistic explanation for the accumulation of succinate, they demonstrate that IL1b decreases expression of SDH the enzyme that degrades succinate through a methylation event in the SDH promoter. This work presents a solid explanation for an important phenomenon. Below are a few questions that should be resolved experimentally.

      1) The authors should confirm through direct biochemical assays of enzymatic activity that steroidogenesis enzyme activity is not impaired. Many of these enzymes are located in the mitochondria and their activity may be diminished due to the disturbed, high succinate environment of the cortical cell as opposed to the low ATP production.

      We thank the Reviewer for this question. The activity of the first and rate-limiting steroidogenic enzyme, cytochrome P450-side-chain-cleavage (SCC, CYP11A1) which generates pregnenolone from cholesterol, was recently shown to require intact SDH function (5). In agreement with this report we show that production of progesterone, the direct derivative of pregnenolone, is impaired upon SDH inhibition (Figure 5b,e,h). In addition, we assessed the activity of CYP11B1 (steroid 11β-hydroxylase), the enzyme catalyzing the conversion of 11-deoxycorticosterone to corticosterone, i.e. the last step of glucocorticoid synthesis, by determining the corticosterone and 11-deoxycorticosterone levels by LC-MS/MS and calculating the ratio of corticosterone to 11-deoxycorticosterone in ACTH-stimulated adrenocortical cells and explants. The corticosterone / 11-deoxycorticosterone ratio was not affected by Sdhb silencing in adrenocortical cells (Figure 5- Supplement 2g) nor did it change upon LPS treatment in adrenal explants (Figure 5- Supplement 2h), suggesting that CYP11B1 activity may not be altered upon SDH blockage. Hence, we propose that upon inflammation impairment of SDH function may disrupt at least the first steps of steroidogenesis (producing pregnenolone/progesterone), thereby diminishing production of all downstream adrenocortical steroids. This is now discussed in the revised manuscript.

      2) What is the effect of high ROS production? Is steroidogenesis resolved if ROS is pharmacologically decreased even if the reduction of ATP is not resolved?

      We thank the Reviewer for this suggestion, which helped us to broaden our findings. Indeed, ROS scavenging by the vitamin E analog Trolox (Figure 5n) partially reversed the inhibitory effect of DMM on steroidogenesis (Figure 5o,p), suggesting that impairment of SDH function impacts steroidogenesis also via enhanced ROS production (Figure 4g).

      3) Does increased intracellular succinate (through cell permeable succinate treatment) inhibit steroidogenesis even if there is not a blockage of OXPHOS?

      We suggest that SDH inhibition and succinate accumulation lead to reduced steroidogenesis due to impaired oxidative phosphorylation (Figure 4c,e, 5i), reduced ATP synthesis (Figure 4d, 5j-m) and increased ROS production (Figure 4g, 5o,p). Since SDH is part (complex II) of the electron chain transfer it cannot be decoupled from oxidative phosphorylation, thereby limiting the experimental means for addressing this question.

      4) It should be demonstrated the genetic loss of IL1 signaling in adrenal cortical cells results in a loss of the effect of LPS on reduced steroidogenesis and increased succinate accumulation.

      We thank the Reviewer for this suggestion. Development of a mouse line with genetic loss of Il-1r in adrenocortical cells was rather impossible during the short time of revisions. Instead, mice under LPS treatment were treated with the IL-1R antagonist, Raleukin, to study the in vivo effects of IL-1β in the adrenal gland. IL-1R antagonism increased SDH activity in the adrenal cortex (Figure 6r), decreased succinate levels and the succinate/fumarate ratio in the adrenal gland (Figure 6s,t) and enhanced corticosterone production (Figure 6u) in LPS-treated mice, supporting our hypothesis that IL-1β mediates the effects of systemic inflammation in the adrenal cortex.

      5) It should be demonstrated the genetic loss of IL1 signaling in adrenal cortical cells results in a loss of the effect of LPS on SDH activity and ATP production and SDH promoter methylation

      As outlined above, Raleukin treatment increased SDH activity in the adrenal cortex (Figure 6r) and decreased succinate levels and the succinate/fumarate ratio in the adrenal gland (Figure 6s,t) of mice treated with LPS. Furthermore, IL-1β reduced the ATP/ADP ratio (Figure 6e) and enhanced SDHB promoter methylation in NCI-H295R cells (Figure 6k).

      6) It should be shown that the silencing of DNMT eliminates or diminishes the effect of LPS on reduced steroidogenesis and increased succinate accumulation.

      We thank the Reviewer for this suggestion, which prompted us to strengthen the evidence for the implication of DNMT1 in the effects of LPS on adrenocortical cell metabolism and function. As mentioned above, development of a new mouse line, in this case bearing genetic loss of DNMT1 in adrenocortical cells, was considered impossible during the short time of revisions. Therefore, we assessed the role of DNMT1 by silencing it via siRNA transfections in primary adrenocortical cells and NCI-H295R cells. We show that DNMT1 silencing inhibits the effect of IL-1β on SDHB promoter methylation (Figure 6k), restores Sdhb expression (Figure 6l) and reduces the succinate/fumarate ratio in IL-1β treated adrenocortical cells (Figure 6m). Accordingly, DNMT1 silencing restores ACTH-induced production of corticosterone, 11-deoxycorticosterone and progesterone in IL-1β treated adrenocortical cells (Figure 6o-q). We chose to stimulate adrenocortical cells with IL-1β instead of LPS, as in vitro the effects of IL-1β were more robust than these of LPS (possibly due to a reduction of TLR4 expression or function in cultured adrenocortical cells) and in order to show the link between IL-1β and DNMT1.

      7) Does silencing of DNMT reduce OXPHOS in adrenal cortical cells?

      We measured the oxygen consumption rate in NCI-H295R cells, which were transfected with siRNA against DNMT1 and treated or not with IL-1β. IL-1β reduced the OCR in cells transfected with control siRNA, while DNMT1 silencing blunted the effect of IL-1β (Figure 6n).

      8) The effects of LPS on reduced adrenal steroidogenesis are not elaborated at the physiological level. The manuscript should demonstrate the ramifications of the adrenal function decreasing after LPS. Does CORT release become less pronounced after subsequent challenges? Does baseline CORT decrease at some point? No physiological consequences are shown. Similarly, these physiological consequences of decreased adrenal function should be dependent on decreased SDH activity and OXPHOS in adrenal cells and this should be demonstrated experimentally.

      We thank the Reviewer for raising this excellent question. Inflammation is a potent inducer of the Hypothalamus-Pituitary-Adrenal gland (HPA) axis, causing increased glucocorticoid production, a stress response leading to vital immune and metabolic adaptations. Accordingly, LPS treatment rapidly increases glucocorticoid production in mice (1, 6, 7). Reduced adrenal gland responsiveness to ACTH associates with decreased survival of septic mice (8). These preclinical findings stand in accordance with observations in septic patients, in which impairment of adrenal function correlates with high risk for death (9). Along this line, ACTH test was suggested to have prognostic value for identification of septic patients with high mortality risk (9, 10).

      In order to confirm impairment of the adrenal gland function in septic mice, animals were subjected to sepsis via administration of a high LPS dose (10 mg / kg) and treated with ACTH 24 h later. Indeed, the ACTH-induced increase in corticosterone levels was diminished in LPS-treated mice (Author response image 3). This finding was further confirmed in adrenal explants, in which LPS pre-treatment also blunted ACTH-stimulated corticosterone production (Figure 5s).

      Author response image 3. High LPS dose blunts the ACTH response in mice. C57BL/6J mice were i.p. injected with 10 mg/kg LPS or PBS and 24 h later they were i.p. injected with 1 mg/kg ACTH. One hour after ACTH administration blood was retroorbitally collected and corticosterone plasma levels were determined by LC-MS/MS. n=4-5; data are presented as mean ± s.d. Statistical analysis was done with two-tailed Mann-Whitney test. *p < 0.05.

      Given that purpose of our studies was to dissect the mechanisms underlying adrenal gland dysfunction in inflammation rather than analyzing the physiological consequences thereof, we chose not to follow these lines of investigations and concentrate on the role of cell metabolism in adrenocortical cells in the context of inflammation.

      References

      1. W. Kanczkowski, A. Chatzigeorgiou, M. Samus, N. Tran, K. Zacharowski, T. Chavakis, S. R. Bornstein, Characterization of the LPS-induced inflammation of the adrenal gland in mice. Mol Cell Endocrinol 371, 228-235 (2013).
      2. L. S. Chen, S. P. Singh, M. Schuster, T. Grinenko, S. R. Bornstein, W. Kanczkowski, RNA-seq analysis of LPS-induced transcriptional changes and its possible implications for the adrenal gland dysregulation during sepsis. J Steroid Biochem Mol Biol 191, 105360 (2019).
      3. V. I. Alexaki, G. Fodelianaki, A. Neuwirth, C. Mund, A. Kourgiantaki, E. Ieronimaki, K. Lyroni, M. Troullinaki, C. Fujii, W. Kanczkowski, A. Ziogas, M. Peitzsch, S. Grossklaus, B. Sonnichsen, A. Gravanis, S. R. Bornstein, I. Charalampopoulos, C. Tsatsanis, T. Chavakis, DHEA inhibits acute microglia-mediated inflammation through activation of the TrkA-Akt1/2-CREB-Jmjd3 pathway. Mol Psychiatry 23, 1410-1420 (2018).
      4. C. Yang, J. C. Matro, K. M. Huntoon, D. Y. Ye, T. T. Huynh, S. M. Fliedner, J. Breza, Z. Zhuang, K. Pacak, Missense mutations in the human SDHB gene increase protein degradation without altering intrinsic enzymatic function. FASEB J 26, 4506-4516 (2012).
      5. H. S. Bose, B. Marshall, D. K. Debnath, E. W. Perry, R. M. Whittal, Electron Transport Chain Complex II Regulates Steroid Metabolism. iScience 23, 101295 (2020).
      6. W. Kanczkowski, V. I. Alexaki, N. Tran, S. Grossklaus, K. Zacharowski, A. Martinez, P. Popovics, N. L. Block, T. Chavakis, A. V. Schally, S. R. Bornstein, Hypothalamo-pituitary and immune-dependent adrenal regulation during systemic inflammation. Proc Natl Acad Sci U S A 110, 14801-14806 (2013).
      7. W. Kanczkowski, A. Chatzigeorgiou, S. Grossklaus, D. Sprott, S. R. Bornstein, T. Chavakis, Role of the endothelial-derived endogenous anti-inflammatory factor Del-1 in inflammation-mediated adrenal gland dysfunction. Endocrinology 154, 1181-1189 (2013).
      8. C. Jennewein, N. Tran, W. Kanczkowski, L. Heerdegen, A. Kantharajah, S. Drose, S. Bornstein, B. Scheller, K. Zacharowski, Mortality of Septic Mice Strongly Correlates With Adrenal Gland Inflammation. Crit Care Med 44, e190-199 (2016).
      9. D. Annane, V. Sebille, G. Troche, J. C. Raphael, P. Gajdos, E. Bellissant, A 3-level prognostic classification in septic shock based on cortisol levels and cortisol response to corticotropin. JAMA 283, 1038-1045 (2000).
      10. E. Boonen, S. R. Bornstein, G. Van den Berghe, New insights into the controversy of adrenal function during critical illness. Lancet Diabetes Endocrinol 3, 805-815 (2015).
      11. C. C. Huang, Y. Kang, The transient cortical zone in the adrenal gland: the mystery of the adrenal X-zone. J Endocrinol 241, R51-R63 (2019).
    1. Author Response:

      Reviewer #1:

      The manuscript by Jasmien Orije and colleagues has used advanced Diffusion Tensor and Fixel-Based brain imaging methods to examine brain plasticity in male and female European starlings. Songbirds provide a unique animal model to interrogate how the brain controls a complex, learned behaviour: song. The authors used DT imaging to identify known and uncover new structural changes in grey and white matter in male and female brains. The choice of the European starling as a model songbird was smart as this bird has a larger brain to facilitate anatomical localization, clear sex differences in song behavior and well-characterized photoperiod-induced changes in reproductive state. The authors are commended for using both male and female starlings. The photoperiodic treatment used was optimal to capture the key changes in physiological state. The high sampling frequency provides the capability to monitor key changes in physiology, behaviour and brain anatomy. Two exciting findings was the increased role of cerebellum and hippocampal recruitment in female birds engaged in singing behaviour. The development of non-invasive, multi-sampling brain imaging in songbirds provides a major advancement for studies that seek to understand the mechanism that control the motivation and production of singing behavior. The methods described herein set the foundation to develop targeted hypotheses to study how the vocal learning, such as language, is processed in discrete brain regions. Overall, the data presented in the study is extensive and includes a comprehensive analyses of regulated changes in brain microstructural plasticity in male and female songbirds.

      Reviewer #2:

      Orije et al. employed diffusion weighted imaging to longitudinally monitor the plasticity of the song control system during multiple photoperiods in male and female starlings. The authors found that both sexes experience similar seasonal neuroplasticity in multisensory systems and cerebellum during the photosensitive phase. The authors' findings are convincing and rely on a set of well-designed longitudinal investigations encompassing previously validated imaging methods. The authors' identification of a putative sensitive window during which sensory and motor systems can be seasonally re-shaped in both sexes is an interesting finding that advances our understanding of the neural basis of seasonal structural neuroplasticity in songbirds.

      Overall, this is a strong paper whose major strengths are:

      1) The longitudinal and non-invasive measure of plasticity employed

      2) The use of two complementary MR assays of white matter microplasticity

      3) The careful experimental design

      4) The sound and balanced interpretation of the imaging findings

      I do not have any major criticism but just a few minor suggestions:

      1) Pp 6-7. While the comparative description of canonical DTI with respect to fixel-based analysis is well written and of interest to readers with formal training in MR imaging, I found this entire section (and especially the paragraphs in page 7) too technical and out of context in a manuscript that is otherwise fundamentally about neuroplasticity in song birds. The accessibility of this manuscript to non-MR experts could be improved by moving this paragraph into the methods section, or by including it as supplemental material.

      The main purpose of this section was to introduce and explain the diffusion parameters which are used throughout the rest of the paper. Furthermore, we wanted to familiarize the reader with the concept of the population based template and the different structures that can be visualized by them. We agree that the technical details might have distracted from this main message. Therefore, we have trimmed the technical details out of this section and left a short explanation of the biological relevance of the different diffusion parameters and the anatomical structures visible on the population template. The technical details that were taken out are now a part of the material and methods section.

      The section now reads as follows:

      In the current study, we analyzed the DWI scans in two distinct ways: 1) using the common approach of diffusion tensor derived metrics such as fractional anisotropy (FA) and; 2) using a novel method of fiber orientation distribution (FOD) derived fixel-based analysis. Both techniques infer the microstructural information based on the diffusion of water molecules, but they are conceptually different (table 1). Common DTI analysis extracts for each voxel several diffusion parameters, which are sensitive to various microstructural changes in both grey and white matter specified in table 1. Fixel-based analysis on the other hand explores both microscopic changes in apparent fiber density (FD) or macroscopic changes in fiber-bundle cross-section (log FC) (table 1). Positive fiber-bundle cross-section values indicate expansion, whereas negative values reflect shrinkage of a fiber bundle relative to the template (Raffelt, Tournier et al. 2017).

      A population-based template created for the fixel-based analysis can be used as a study based atlas in which many of the avian anatomical structures can be identified (figure 2). We recognize many of the white matter structures such as the different lamina, occipito-mesencephalic tract (OM) and optic tract (TrO) among others. Interestingly, many of the nuclei within the song control system (i.e. HVC, robust nucleus of the arcopallium (RA), lateral magnocellular nucleus of the anterior nidopallium (LMAN), and Area X), auditory system (i.e. intercollicular nucleus complex, nucleus ovoidalis) and visual system (i.e. entopallium, nucleus rotundus) are identified by the empty spaces between tracts. The applied fixel-based approach is inherently sensitive to changes in white matter and cannot report on the microstructure within grey matter like brain nuclei; but rather sheds light on the fiber tracts surrounding and interconnecting them. As such, it provides an excellent tool to investigate neuroplasticity of different brain networks, and in the case of a nodular song control system focusing on changes in the fibers surrounding the song control nuclei, referred to as HVC surr, RA surr and Area X surr.

      2) Similarly, many sections, especially results, are in my opinion too detailed and analytical. While the employed description has the benefit of being systematic and rigorous, the ensuing narrative tends to be very technical and not easily interpretable by non experts. I think the manuscript may be substantially shortened (by at least 20% e.g. by removing overly technical or analytical descriptions of all results and regions affected) without losing its appeal and impact, but instead gaining in strength and focus especially if the new result narrative were aimed to more directly address the interesting set of questions the authors define in the introductory sections.

      We rewrote the result section, taking out the statistic reporting when it was also reported in a figure to reduce the bulk of this section and make it more readable. We made some of the descriptions of the regions affected more approachable by replacing it with parts of the discussion. This way we incorporated some of the explanations why certain findings are unexpected or relevant, as suggested by reviewer #3. Parts of text that were originally in the discussion are indicated in purple.

      3) The possible effect of brain size has been elegantly controlled by using a medial split approach. Have the authors considered using tensor-based morphometry (i.e. using the 3D RARE scans they acquired) to account for where in the brain the small differences in brain size occur? That could be more informative and sensitive than a whole-brain volume quantification.

      We have taken into consideration to add tensor-based morphometry, but we feel that log FC calculated with MrTrix can provide a similar account of the localization of these brain differences. Both methods are based on the Jacobean warps created between the individual images and the population template. They only differ in the starting images they use (3D RARE images in tensor-based morphometry or diffusion weighted images in log FC metric of MrTrix3) and the fact that MrTrix3 limits itself to the volume changes along a certain tract.

      The log FC difference in figure 4 gives a similar account of the differences in brain size between both sexes. Additionally, figure 6 indicates the log FC differences between small and large brain birds.

      4) I think Figures Fig. 3 and Fig. 4 may benefit from a ROI-based quantification of parameters of interests across groups (similar to what has been done for Fig. 7 and its related Fig. 8). This could help readers assess the biological relevance of the parameter mapped. For instance, in Fig. 3, most FA differences are taking place in low FA (i.e. gray matter dense?) regions.

      We supplied the figures with extracted ROI-based parameters of figure 3 and figure 4. In line with this reasoning we also added the same kind of supplementary figures for figure 5 and 6.

      Figure 3 - figure supplement 1: Overview of the fractional anisotropy (FA) changes over time extracted from the relevant ROI-based clusters with significant sex differences. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant sex differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the fractional anisotropy values are not significantly different from each other.

      Figure 4 – figure supplement 2: Overview of the fiber density (FD) changes over time extracted from the relevant ROI-based clusters with significant sex differences. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant sex differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the FD values are not significantly different from each other. Abbreviations: surr, surroundings.

      Figure 4 –figure supplement 3: Overview of the fiber-bundle cross-section (log FC) changes over time extracted from the relevant ROI-based clusters with significant sex differences. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant sex differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the log FC values are not significantly different from each other. Abbreviations: surr, surroundings.

      Figure 5 – figure supplement 1: Overview of the fractional anisotropy (FA) changes over time in extracted from the relevant ROI-based clusters with significant differences in brain size. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant brain size differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the fractional anisotropy values are not significantly different from each other. Abbreviations: C, caudal; surr, surroundings.

      Figure 6- figure supplement 2: Overview of the fiber density (FD) changes over time in extracted from the relevant ROI-based clusters with significant differences in brain size. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant brain size differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the FD values are not significantly different from each other. Abbreviations: C, caudal; surr, surroundings.

      Figure 6- figure supplement 3: Overview of the fiber-bundle cross-section (log FC) changes over time in extracted from the relevant ROI-based clusters with significant differences in brain size. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant brain size differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the log FC values are not significantly different from each other. Abbreviations: C, caudal; surr, surroundings.

      5) In Abstract: "We longitudinally monitored the song and neuroplasticity in male.." Perhaps something should be specified after the "the song"? Did the authors mean "the neuroplasticity of song system"?

      No, this is not what we meant, we monitor song behavior and neuroplasticity independently. In our study, we do not limit ourselves to the neuroplasticity of the song system, but instead use a whole brain approach. The monitoring of the song behavior in itself might be useful for other songbird researchers.

      We clarified this in the abstract as follows:

      We longitudinally monitored the song behavior and neuroplasticity in male and female starlings during multiple photoperiods using Diffusion Tensor and Fixel-Based techniques.

      Reviewer #3:

      In their paper, Orije et al used MRI imaging to study sexual dimorphisms in brains of European starlings during multiple photoperiods and how this seasonal neuroplasticity is dependent in brain size, song rates and hormonal levels. The authors main findings include difference in hemispheric asymmetries between the sexes, multisensory neuroplasticity in the song control system and beyond it in both sexes and some dependence of singing behavior in females with large brains. The authors use different methods to quantify the changes in the MRI data to support various possible mechanisms that could be the basis of the differences they see. They also record the birds' song rates and hormonal levels to correlate the neural findings with biological relevant variables.

      The analysis is very impressive, taking into account the massive data set that was recorded and processed. Whole-brain data driven analysis prevented the authors from being biased to well-known sexually dimorphic brain areas. Sampling of a large number of subjects across many time points allowed for averaging in cases where individual measurements could not show statistical significance. The conclusions of the paper are mostly well supported by data (except of some confounds that the authors mention in the text). However, the extensive statistically significant results that are described in the paper, make it hard to follow at times.

      1) In the introduction the authors mention the pre optic area as a mediator for increase singing and therefore seasonal neuroplasticity. Did the authors find any differences in that area or other well know nuclei that are involved in courtship (PAG for example)?

      Interestingly, we did not detect any seasonal changes in the pre-optic area or PAG. Whereas prior studies reported volume changes in the POM within 1-2 days after testosterone administration in canaries (Shevchouk, Ball et al. 2019). In male European starlings, POM volumes changed seasonally, although this seems to depend on whether or not the males possessed a nest box (Riters, Eens et al. 2000). In our setup, our starlings are not provided with nest boxes. The lack of seasonal change in POM could have a biological reason, besides the limitations of our methodology. Since these are small regions and are grey matter like structures, they are less likely to be picked up with our diffusion MRI methods.

      2) Following the first comment, what is the minimum volume of an area of interest that could be detected using the voxel analysis?

      The up-sampled voxel size is (0.1750.1750.175) mm3. In the voxel-based statistical analysis a significance threshold is set at a cluster size of minimum 10 voxels: 0.05 mm3.

      3) It would be useful to have a figure describing the song system in European starlings and how the auditory areas, the cerebellum and the hippocampus are connected to it, before describing the results. It would make it easier for the broader community to make a better sense of the results.

      An additional figure was added to the introduction to give a schematic overview of the song control system, the auditory system and the proposed cerebellar and hippocampal projections. This scheme includes both a 2D, and a 3D representation as well as a movie of the 3D representation of the different nuclei and the tractography.

      Figure 1: Simplified overview of the experimental setup (A), schematic overview of the song control and auditory system of the songbird brain and the cerebellar and hippocampal connections to the rest of the brain (B) and unilateral DWI-based 3D representation of the different nuclei and the interconnecting tracts as deduced from the tractogram (C). Male and female starlings were measured repeatedly as they went through different photoperiods. At each time point, their songs were recorded, blood samples were collected and T2-weighted 3D anatomical and diffusion weighted images (DWI) were acquired. The 3D anatomical images were used to extract whole brain volume (A). The song control system is subdivided in the anterior forebrain pathway (blue arrows) and the song motor pathway (red arrows). The auditory pathway is indicated by green arrows. The orange arrows indicate the connection of the lateral cerebellar nucleus (CbL) to the dorsal thalamic region further connecting to the song control system as suggested by (Person, Gale et al. 2008, Pidoux, Le Blanc et al. 2018) (B,C). Nuclei in (C) are indicated in grey, the tractogram is color-coded according to the standard red-green-blue code (red = left-right orientation (L-R), blue = dorso-ventral (D-V) and green = rostro-caudal (R-C)). For abbreviations see abbreviation list.

      Figure 1 – figure supplement 1: Movie of the unilateral 3D representation of the different nuclei and the interconnecting tracts rotating along the vertical axis.

      4) In the results section the authors clearly describe which brain areas are sexually dimorphic or change during the photoperiod and what is the underlying reason for the difference. However, only in the discussion section it is clearer why some of those differences are expected or surprising. It would be useful to incorporate some of those explanations in the results section other than just having a long list of brain areas and metrics. For example, I found the involvement of visual and auditory areas in the female brain in the mating season very interesting.

      Next to the reductions in technical explanation suggested by reviewer #2, We replaced some of the description of significant regions with parts of the discussion and vice versa(indicated in purple). This way we incorporated some of the explanations why certain findings are unexpected or relevant. Furthermore, we added some extra info on the reason why these changes are relevant for the visual system and the cerebellum.

      In line 420: Neuroplasticity of the visual system could be relevant to prepare the birds for the breeding season, where visual cues like ultraviolet plumage colors are important for mate selection (Bennett, Cuthill et al. 1997).

      In line 424: This shows that multisensory neuroplasticity is not limited to the cerebrum, but also involves the cerebellum, something that has not yet been observed in songbirds.

    1. Author Response

      Reviewer #1 (Public Review):

      […] Overall, the results from these analyses are convincing and valuable, but still do not seem to be a big leap from their Unger 2021 paper […]. The methodology that they established should be described more clearly so that it can be shared with the research community. For example, they say cells how many donors were recruited for this experiment? are there differences in efficiency in B cell differentiation by individual?

      Also, it would be important to assay for antibodies in the culture media. How would you suggest to improve the culture system to be used to model diseases?

      We appreciate the reviewer's queries and the points raised. In response to the first set of comments, the reviewer has correctly observed that the methodology of the assay itself as employed in this paper is not new or superior to our previously published data in (Unger et al., Cells 2021), where we described a minimalistic in vitro system for efficient differentiation of human naive B cells into antibody-secreting cells (ASCs). However, the current study aims to elucidate a comprehensive evaluation of the phenotype of the cells in the in vitro system and their relationships in potential differentiation pathways. In addition, we aimed to elucidate how the detailed gene expression profiles of the differentiating cells in vitro compare to in vivo observed counterparts. In this way, we were able to uncover an antibody secreting cell phenotype in vivo that was not observed before and could only be uncovered due to our full transcriptome knowledge of these cells. In addition, we present novel findings that demonstrate that this culture system not only enables efficient ASCs generation but also recapitulates the entire in vivo B cell differentiation pathway, as evidenced by the presence of germinal-center (GC)-like and pre-memory B cells in the culture. These results have not been previously reported in the literature for human B cells in culture and represent a significant contribution to the field of human B cell biology.

      In regards to the reviewer's inquiry about the cell culture protocol, its reproducibility, donors variability, and additional experimental applications, we refer to three additional recent publications from our group that have adopted the same in vitro B cell differentiation system and have provided extensive analysis of the immunoglobulin production, intracellular signaling pathways, as well as comparison with other culture systems in the field (Marsman et al., Cells 2020; Marsman et al., Eur. J. Immunol. 2022; Marsman et al., Front. Immunol. 2022). On top pf this, we now realize that the section that describes the culture system (MATERIAL AND METHODS - “In vitro naive B cell differentiation cultures”) was a bit too concise and we thank the reviewer for mentioning it. We have extended now on it and corrected an inconsistency at lines 125-127: “After six days, activated B cells were collected and co-cultured with 1 × 104 9:1 wild type (WT) to CD40L-expressing 3T3 cells that were irradiated and seeded one day in advance (as described above), together with IL-4 (100 ng/ml) and IL-21 (50 ng/ml; Invitrogen) for five days.”

      As for the application of our in vitro system in disease modeling, as requested by the reviewer, this would require modifying the culture conditions to mimic the disease-specific biology background (if known). For instance, by inhibiting or enhancing specific transcriptional pathways that are known to be associated with the disease in question. However, it would also require the presence of antigen-specific B cells in the pool of naive B cells included in the culture, which can be difficult to achieve due to their low frequency. Alternatively, the system could be used to study antigen-specific recall responses using antigen-specific memory B cells as starting material. Our group has evaluated this approach in a recent publication (Marsman et al., Front. Immunol 2022).

      [..] B cell differentiation may also influence to cell cycle regulation. Rather than normalize its effect, can authors analyze effect of cell cycle in B cell differentiation? [...]

      We very much agree with the reviewer and know that the cell cycle plays a significant role in B cell differentiation output trajectories (Zhou et al, Front Immunol. 2018; Duffy et al., Science 2012). Preparing the manuscript, we have in fact performed a parallel analysis in which we compared both cell cycle regressed- and not cell cycle regressed-based clustering and marker gene selection. Concerning the clustering, other clusters were obtained using the not cell-cycle-regressed dataset compared to the cell-cycle-regressed dataset (figure below). However, when overlaying the clusters obtained with the cell cycle-regressed dataset, the extra clusters were the same cell population but now split based on cycling and not cycling cells: cluster 2 is now divided into the cycling cluster “c”, and the not-cycling cluster “d” while cluster 4 and 5 are now divided into the cycling clusters “e” and the not-cycling cluster “f”. A comprehensive examination of the expression of the top 50 genes associated with antibody-secreting cells in the (non)cycling clusters 4 and 5 reveals that these genes are expressed at a higher level in (non)cycling cluster 5 as compared to cluster 4. This suggests that the cells within cluster 5 are more advanced in their differentiation, regardless of their cell cycle state. This finding has led us to the decision to present the data that has undergone cell cycle regression in the manuscript. Should the reviewer so desire, we are very willing to include additional supplementary figures to the manuscript that include the un-regressed representation.

      Figure legend: A-C) UMAP projection of single-cell transcriptomes of in vitro differentiated human naive B cells without cell cycle regression. Each point represents one cell, and colors indicate graph-based cluster assignments identified without cell-cycle regression (A), with cell cycle regression (B) or with cell cycle regression and additional subdivision in cycling and not cycling cells (C). D) Dotplot showing the top 50 differentially expressed genes in cycling and not-cycling cells from cluster 4 and 5. Point size indicate percentage of cell in the cluster expressing the gene, color indicates average expression

    1. Author Response

      Reviewer #3 (Public Review):

      The manuscript by the Qiu and Lu labs investigates the mechanism of desensitization of the acid-activated Cl- channel, PAC. These trimeric channels reside in the plasma membrane of cells as well as in organelles and play important roles in human physiology. PAC channels, like many other ion channels, undergo a process known as desensitization, where the channel adopts a non-conductive conformation in the presence of a prolonged physiological stimulus. For PAC the mo-lecular mechanisms regulating this process are not well understood. Here the authors use a com-bination of electrophysiological recordings and MD simulations to identify several acidic residues and a conserved histidine side chain as important players in PAC desensitization. The results are overall interesting and clearly indicate a role for these residues in this process. However, there are several weaknesses in the experimental design, inconsistencies between the mutagenesis data and the MD results, as well as in the interpretation of the data. For these reasons I do not think the authors have made a convincing mechanistic case.

      We thank the reviewer for the constructive comments and address the concerns point-by-point below.

      Major weaknesses:

      The underlying assumption in the interpretation of all the data is that the mutations stabilize or destabilize the desensitized conformation of the channel. However, none of the functional meas-urements provide direct evidence supporting this key assumption. Without direct evidence sup-porting the notion that the mutations specifically impact the rate of recovery from desensitiza-tion, I do not think the authors have made a convincing mechanistic case.

      We agree with the reviewer that our functional data measure the degree and rate of the PAC channel entering desensitization from the activated state upon prolonged acid treatment. This is a common experimental procedure for research on desensitization/inactivation of ion channels. Fol-lowing the reviewer’s suggestion, we also sought to capture the kinetics from the desensitized state to the activated state by switching from more acidic pH to less acidic pH (for example 4.0 to 5.0) or neutral pH. However, we found that such experiments are not feasible partly because the kinetics of PAC desensitization is much slower compared to other channels, such as ASIC channels (see a recent study we cited: https://elifesciences.org/articles/51111). For the mutants with strong desensitization (E94R and D91R), it’s unclear whether the currents we recorded at pH 5.0 right after pH 4.0 representing the activated state or the desensitized state at pH 5.0. In other words, we don’t know if the PAC channel transitions from the desensitized state from a lower pH back to the activated state or rather directly to the desensitized state at a higher pH. For the mutants with reduced desensitization, the current amplitude at pH 4.0 were often similar to that at pH 5.0, which makes the recovery/transition variable. We also tried to switch the acidic pH to neutral pH. We found that the PAC channels (both WT and mutants) go back to the closed state from the desensitized state in seconds as limited by our perfusion speed. These data suggest that the desensitized state of PAC is no longer maintained after switching buffer from low pH to neutral pH. In summary, it’s technically infeasible, in our opinion, to measure the rate of recovery from desensitization to activation for the PAC channel. However, our data do support the con-clusion that the rates of entering desensitization from the activated state, a standard measurement of desensitization, change for various channel mutants we studied.

      Overall, the agreement between the MD simulations, functional data, and interpretation are often weak and some issues should be acknowledged and addressed.

      For example:

      1) The experimental data suggests that H98, E107, and D109 play analogous roles in PAC desen-sitization. However, the MD simulations suggest that the H98-D109 interaction energy is ~4 times larger than that of H98-E107. This should lead to a much greater effect of the D109 muta-tion. How is this rationalized?

      The purpose of quantifying the interaction between H/R98 with E107 and D109 is to better dis-sect the mechanism by which H/R98 interacts with the acidic pocket residues. The result suggests that R98 has a reduced association with E107/D109 when compared to H98. It also suggests that D109 makes a more direct interaction with H/R98 when compared to E107. We acknowledge that this is not clear in our initial manuscript and we have updated the text to better describe this result. However, this doesn’t imply that the desensitization phenotype of E107R should be less pronounced than D109R. Both E107R and D109R are expected to disrupt the integrity of the acidic pocket, thus resulting in diminished channel desensitization. It is worth pointing out that E107 played a more complex role as it was identified in our previous papers as one of the major proton sensors. The E107R mutant could allow the PAC channel to become more sensitive to ac-id-induced activation (Figure 4d-e in Ruan et al, Nature, 2020), further complicating its effect in desensitization. Taken together, we don’t think the E107/D109 and H/R98 interaction strength could have quantitative correlation with the desensitization phenotype of E107R and D109R.

      2) The experimental data shows that E94 plays a key role in desensitization and the authors argue that this is due to the interactions of this residue with the β10-11 linker. However, the MD simu-lations show that these interactions happen for a small fraction, ~10%, of the time and with inter-action energies comparable to those of the H98-E107-D109 cluster. It is not clear how these sparse and transient interactions can play such a critical role in desensitization. Also, if the inter-action energies are of the same sign, how come one set of mutants favors desensitization and one does not?

      The 10% value is the amount of time when at least a hydrogen bond forms between E94/R94 and the β10–β11 loop. It is NOT the amount of time that they form interactions, as there could be other types of non-bonded interactions such as Van der Waals interaction and Coulombic interaction. In fact, our non-bonded energy calculation clearly suggests that R94 interacts with the β10–β11 loop much more favorably than E94 (Figure 4C). The impact of E94R on β10–β11 loop is also reflected in the root-mean-square-fluctuation analysis, where the β10–β11 loop shows a reduced flexibility when R94 is present (Figure 4B).

      Our central hypothesis is that PAC becomes more prone to desensitization when the desensitized conformation is stabilized. Two critical interactions are characteristic of the desensitized structure of PAC, including the association of the E94 with the β10–β11 loop, and H98 with E107/D109. Therefore, we expect mutations that alter these interactions to affect PAC channel desensitization. Based on the MD simulations, we observed the root-mean-square-fluctuation of β10–β11 loop are reduced for E94R when compared to WT (Figure 4B), suggesting that β10–β11 loop is stabilized when E94 is replaced by an arginine. The non-bonded interaction energy between E94 and the β10–β11 loop is also more negative for E94R when compared to WT (Figure 4C), another indicator of conformation stabilization. As a result, the E94R mutant favors desensitization. This is in sharp contract with the H98R data, in which H98R interact less favorably with E107/D109 (Figure 2F, G, H, I) when compared to WT. Although the interaction energies are of the same sign, it is the difference between WT and the mutants that will ultimately determine whether a certain mutation will favor desensitization or not.

      The authors' MD analysis critically depends on assumptions on the protonation states of multiple residues, that are often located in close proximity to each other. In the methods, the authors state they use PropKa to estimate the pKa of residues and assigned the protonation states based on this. I have several questions about this procedure:

      • What pH was considered in the simulations? I imagine pH 4.0 to match that of the electrophys-iological experiments.

      The exact pH environment cannot be explicitly modeled in standard MD as the protonation state of an ionizable group is not allowed to change during the simulation. Therefore, in our simulation, we prepared the MD system by first predicting the pKa of titratable residues of PAC in the de-sensitized state, and then assign the protonation status of these residues based on the pKa values. We acknowledge that the description in this part is not very clear in our original manuscript. We have revised the method to better describe how the protonation status is assigned.

      • Was the propKa analysis run considering how choices in the protonation state of neighboring residues affect the pKa of the other residues? This is critical because the interaction energies will greatly depend on the protonation state chosen.

      The pKa analysis was done based on the WT structure and the residue protonation status was assigned based on the predicted value. It is possible that mutations on certain residues could change the pKa of neighboring residues. To evaluate this impact, we carried out pKa prediction for all the mutant structures that we used as input for simulation. This is summarized in the table below:

      As shown in the table, although mutations will affect the pKa of neighboring residues, the impact is generally within 0.3 units. As our simulation is carried out based on a pH of 4.0, this variability will not affect how we assign the residue protonation status.

      • Was the pKa for the mutant constructs re-evaluated? For example, does having a Gln or Arg in place of a His affect the pKa of nearby acidic residues?

      We didn’t re-evaluate the pKa for each mutant in our initial manuscript. We have conducted such an analysis as indicated in the above table. The result suggests that arginine substitutions of H98/E94/D91 could have an impact on the pKa value of nearby residues. However, the differ-ence is relatively small and does not alter the predominant protonation status of these residues at pH 4.0.

      • H98R and Q have the same functional effect. The MD partially rationalizes the effect of H98R, however, it is not clear how Q would have the same effect as R on the interaction energies.

      Our analysis on H98R and H98Q serves two different purposes. H98 is expected to be protonat-ed at pH 4.0. The fact that H98Q mutant reduced PAC desensitization suggests that positive charge at the location is critical for PAC desensitization, which we attribute to the loss of favora-ble interaction between H98 and E107/D109. This is different from H98R mutant as arginine bears the same amount of charge as a protonated histidine. Our data suggest that the exact bio-chemical property, including its charge and side-chain flexibility, of H98 is crucial for PAC de-sensitization.

      • Are 600 ns sufficient to evaluate sampling of the different conformations?

      Our MD analysis doesn’t intend to sample large conformational transitions between different functional state. Instead, our analysis focused on local dynamics which allowed us to correlate the observation with electrophysiology data. During the revision, we have extended our simula-tion to 1 μs for each mutant. It is worth pointing out that because PAC protein is a trimer, and we performed all the calculations across three subunits. Therefore, the effective sampling time would become 3 μs in total. The new result remains the same as our initial analysis, suggesting that the sampling time is sufficient to evaluate the metrics reported in the study. We also acknowledged this limitation of our study in the discussion.

    1. Author Response:

      Reviewer #2 (Public Review):

      The molecular mechanisms as well as the cellular players of colonization of the adult thymus are incompletely understood. In this manuscript, the authors investigate the role of the SIRPa-CD47 ligand pair in seeding of bone-marrow derived progenitors to the adult murine thymus. The study is based on the authors' earlier characterization of thymic portal endothelial cells, which have a role in mediating progenitor homing to the thymus (Shi et al., 2016). The authors show that loss of SIRPa or CD47 results in reduced frequencies and numbers of early T lineage progenitors (ETPs), but no substantial alterations in thymocyte numbers at later developmental stages and of bone-marrow precursors. Short-term homing assays suggest impaired colonization of the thymus. The authors further characterize cell biology and biochemistry of the SIRPa-CD47 system using peripheral lymphocyte co-cultures with genetically engineered MS1 endothelial cells. Finally, they assess the role of SIRPa-CD47 in thymus regeneration in combination with growth of a model tumor.

      Strengths:

      The authors describe a clear phenotype, consistent with the moderate effect size in ETP loss upon deletion of other homing mediators, such as PSGL-1 or individual chemokine receptors, such as CCR7, CCR9 or CXCR4.

      The authors use multiple genetic models, including both, SIRPa and CD47 deficient mouse strains, to support their findings. Using the Tie2Cre model for endothelial cell-specific deletion is particularly informative and could have been used more extensively. Some data are further strengthened by the complementary use of inhibitory SIRPa-Ig fusion proteins.

      In vitro analysis of the molecular mechanism and the role of signaling mediators using MS1 cells is well executed and conclusive.

      Weaknesses:

      Short-term homing assays suffer from the problem that the system is overwhelmed by an excessive number of donor cells (millions), whereas at steady state only a few hundred HPCs capable of colonizing the thymus circulate in peripheral blood, questioning the physiological relevance of this approach. The short-term nature of the experiments also precludes analysis, whether homed cells do in fact constitute T cell progenitors. More suitable experiments comprise mixed competitive bone marrow chimeras using congenically discernible donor cells or, even better, transfers into non-irradiated recipients of defined age as pioneered by the Goldschneider and Petrie labs. Thus, the conclusion that the SIRPa-CD47 system mediates homing of thymus seeding progenitors is not fully justified.

      a) Thank you for the comments. To overcome the disadvantage of total bone marrow transfer, we sorted progenitor-containing lineage- bone marrow cells, which takes about 3% of the total bone marrow cells, by MACS enrichment followed by FACS. The amount of donor cells needed for transfer was therefore reduced from 5×10^7 total bone marrow cells per mouse to less than 1×10^6 lineagecells per mouse. This would prevent the overwhelming effect in the previous method. Result of short-term homing assay with 1×10^6 lineage- bone marrow cells confirmed the homing defect in the thymus of Sirpα^-/- mice (new Figure 2I), but not in the spleen (new Figure2—figure supplement 2J).

      b) To track whether immigrated lineage^- progenitors actually develop into thymocytes, we conducted adoptive transfer of congenically marked (CD45.1) WT lineage^- into naïve non-irradiated WT or Sirpα^-/- (CD45.2) recipients. 3 weeks later, donor-derived cell subsets were detected. Significant defect of donor-derived thymocyte development, particularly at DN and DP stages, was found in Sirpα^-/- mice as shown in new Figure 2J,K. Therefore, the defective thymic homing of progenitor cells in Sirpα^-/- mice indeed influence following T cell development.

      c) Mixed bone marrow chimera or mixed congenically discernible WT and CD47KO progenitor cell transfer into non-irradiated WT recipients is not applicable as has been explained in details in response to the 2nd point of Summary of Essential Revisions. This is probably due to rapid clearance of CD47-null cells from the system by phagocytosis(Jaiswal et al., 2009). Therefore, it currently remains a technical difficulty to address the role of CD47 on progenitor cells for thymic homing using mixed competitive bone marrow chimeras or mixed progenitor cell transfer in non-irradiated hosts. Instead, we have used cleaner in vitro transwell assay to confirm the role of CD47 on progenitor cells during TEM (new Figure 4F), as explained in more details just below.

      While technically elegant and mechanistically conclusive, the in vitro studies using MS1 cells and peripheral lymphocytes are somewhat isolated from the original focus of the paper addressing the role of SIRPa-CD47 specifically in thymus seeding. It should be considered devising similar assays replacing lymphocytes with bone-marrow derived progenitors.

      Major in vitro transendothelial migration assays have been repeated with FACS sorted lineage^- bone marrow progenitor cells (Lin^- BMCs). Lin^- BMCs showed significant defect of TEM on Sirpα^-/- ECs compared to that on WT ECs (new Figure 3F); Cd47^-/- Lin^- BMCs also showed significant defect of TEM compared with WT Lin^- BMCs (new Figure 4F). Therefore, the conclusion that progenitor CD47 - endothelial SIRPα signaling is required for TEM remains unchanged.

      Analysis of thymus regeneration is interesting, but a number of open questions remain for this experimental setup, also in part raised by the authors in the discussion section. Most notably, during regeneration, the reduction in ETPs is accompanied by reduced numbers in more mature thymocyte subsets and peripheral T cells. Such a reduction was not observed at steady-state in KO models and it cannot be concluded from this experiment, that these observations are caused by a defect in thymus colonization. Notably, SL-TBI is associated with massive cell death and alterations in phagocytosis and many other factors may come into play here as well.

      We agree with these comments. CV-1 treatment during SL-TBI induced thymic injury and regeneration is a complicated scenario. To make it cleaner, we did SL-TBI directly on Sirpα^-/- mice and control mice. Congenically marked bone marrow cells were also adoptively transferred for better monitoring. At 4 weeks after transfer, donor derived DN thymocyte subset was found defective in Sirpα^-/- recipients compared to that in control hosts (Figure R1). However, DP, SP subsets did not show difference, probably due to compensation effect.

      Figure R1. Reconstitution of bone marrow-derived progenitors in Sirpα^-/- *mice. (A) Schematic view of the experiment. (B,C) Statistics of proportion (B) and cell number (C) of donor derived cells in the thymus 4 weeks after SL-TBI and adoptive transfer. n=6 in each group, unpaired t-test applied. *: p <0.01*

      As the reviewer indicated, SB-TBI is associated with massive changes on many aspects. Therefore, we also tested the role of SIRPα on thymic homing and thymocyte development in steady state. First, we conducted short-term homing assay using sorted lineage- bone marrow progenitor cells instead of total bone marrow cells to avoid the overwhelming effect of massive number of cells used. Short-term homing assay with 1×10^6 lineage^- bone marrow progenitor cells showed similarly significant defect in Sirpα^-/- recipient thymus (new Figure 2I), but not in the spleen (new Figure2—figure supplement 2J). Second, we also examined following T cell development in this scenario. At 3 weeks after adoptive transfer of lineage^- bone marrow progenitor cells, significantly reduced population of donor-derived thymocytes (mainly DP subset) was found in Sirpα^-/- mice (new Figure 2J,K). However, it should be noted that, later stage of thymocyte development, such as SP, was not significantly impaired, although there is a trend to be reduced in Sirpα^-/- mice.

      Thus, our data suggest that while SIRPα deficiency results in impaired thymic homing of progenitor cells and is accompanied with reduced ETP population and impaired early thymocyte development, later thymocyte development is less affected probably due to compensation effect. Whether this effect might be amplified at certain scenarios remains an intriguing open question.

      Taken together, the study in its presents form contains the description of an interesting new phenotype, consistent with a role of the CD47-SIRPa interaction in colonization of the thymus by bone-marrow derived progenitors. However, at present, homing experiments lack sufficient rigor and experiments on thymus regeneration, while showing an interesting additional finding, do not justify to conclude homing as mechanistic explanation.

      Thank you for the comment. With these new data, hopefully the role of SIRPα on thymic progenitor homing, T cell development during steady state and T cell regeneration at SL-TBI scenario has been made clearer. We agree that the causal relationship between thymic progenitor homing and thymus regeneration is still indirect and inconclusive, which may require further investigation in future. In this study, we would like to emphasize more on the novel role of CD47-SIRPα in controlling thymic progenitor homing, and the underlying molecular and biochemical mechanism. We hope these have been validated.

      Reviewer #3 (Public Review):

      The manuscript by Ren et al. seeks to describe a role for endothelial cell (EC) expression of Sirpα playing a role in the importation of hematopoietic progenitors from the circulation into the thymus. Specifically, the authors demonstrate that there is a reduction in the number of the earliest T lineage progenitors (ETPs) in the thymus in mice deficient for Sirpa or CD47 (its ligand), and through a series of elegant in vitro transendothelial migration studies, identify that intracellular Sirpα signaling mediates this process by regulating VE-Cadherin expression and thus EC tight junctions. In particular, the use of transwell assays modified to study TEM is particularly well utilized to tease apart the mechanisms. Overall, I found this to be an excellent manuscript. In fact, every time I had a critique developing in my head, the authors quickly dispensed of it by producing some follow up data that addressed my concern! My biggest concern with the manuscript is that it was difficult to determine exactly how many repeats of each experiment have been performed and what data is being presented in the figures (and being statistically analyzed). This should not change the conclusions of the manuscript but will make reading the figures and matching them with the legends easier. The following are a some major and minor concerns that should be addressed to strengthen the manuscript:

      Major:

      • My main concern is that there needs to be greater care taken with highlighting the number of repeats done for each individual study as it is not always clear. For instance, in Figure 2 the data are presented as being representative of three independent experiments with an n of 3 in each experiment but in 2B, D, and F there are 4 data points for the Sirpa-/- group. This is likely explained by there being 4 mice in that particular experiment, but that is why the numbers should be presented for each experiment rather than a general statement at the end. Another example of this is that in Figure 2 S1 the authors would like to claim that the only differences are in the DN1 subsets which contains the ETPs. However, it is likely this is just due to low numbers as it seems like there is a real decrease in the number of DN2, DN3, DN4 and even DP thymocytes (as well as total cellularity).

      1. This should not change any conclusions of the paper but will aid in reader interpretation.

      Thank you for your advice and we apologize for the negligence and have rechecked all figure legends and reported sample size for each panel individually. Furthermore, we repeated those experiments with too few samples in the group. For mouse experiments, we used littermates for detection which were not always have equal number of individual mouse in each group, now mouse used have been labeled specifically in each experiment. For thymic subset detection in Sirpα^-/- mice, we have increased sample size (n=5 for both Sirpα^-/- and control group as shown in Figure 2—figure supplement 1AE) and indeed found significant decrease of DN2, DN3 and DN4 subsets in Sirpα KO mice, though total cellularity was still not significantly changed. Overall, the conclusion of defective early thymocyte development in Sirpα^-/- mice retains valid.

      2. In this manuscript the authors show that Sirpa expression by TPECs is critical for their capacity to guide the importation of HPCs, and in their previous work they have shown that lymphotoxin can regulate the importation capacity of these same TPECs. Therefore, it would be extremely interesting to know if LT signaling is regulating the expression of Sirpa. Furthermore, it would be important to at least comment on what may be influencing Sirpa expression. For instance, we know from the work of Petrie and others that DN niche availability can influence the ability of the thymus to import of progenitors. Similarly, after TBI the "gates" are let open and the capacity of the thymus to import progenitors increases. Do the authors know (or could they comment) on what happens to Sipra expression after TBI in ECs?

      Thank you for your suggestion. It is an interesting and important question how SIRPα expression is regulated on TPECs. As the reviewer suggested, we examined SIRPα expression in different settings. Given the important role of LT-LTβR signaling on TPEC development and maintenance, we first tested whether LT-LTβR signal would be required for SIRPα expression. However, the remaining TPECs in Ltbr^-/- mice showed similar level of SIRPα expression compared to that in WT mice (new Figure 1—figure supplement 1C). Thymic stromal niche is another factor regulating thymic settling of progenitor cells (Krueger, 2018; Prockop and Petrie, 2004). Increased thymic stromal niche was found during irradiation (Zlotoff et al., 2011). We also detected SIRPα expression on TEPC at Day 14 after 5.5Gy total body sublethal irradiation and found no significant change in SIRPα expression (new Figure 1—figure supplement 1D). Whether SIRPα expression on TPECs is a constitutive event or regulatable upon thymic microenvironmental change remains to be tested in future.

      3. The use of the in vitro TEM assays in transwell plates are a nifty way of interrogating and manipulating the effect of Sirpa in these conditions, however, the caveat is that these all use EC cell lines that do not correspond to the TPECs being described in vivo. This caveat should be acknowledged in the text.

      Thank you for the advice, EC cell line we used is a pancreatic islet endothelial cell line (MS1), which is not derived from or corresponding to TPECs. We have mentioned this caveat in the text.

      4. I am a little confused as to the interpretation of the final experiment looking at tumor clearance. The authors show that this could be clinically relevant as blockade of the CD47-Sirpa axis is becoming an increasingly attractive immunotherapy option but its use could preclude thymic recovery after damage and thus contribute toward poorer T cell responses against tumors. This last study is very interesting but also very hard to interpret given the likely positive effect of Sirpa-CD47 blockade on tumor clearance, in opposition to its potential effects hindering thymic repair. While it is notable that there is reduced clearance of tumor in mice treated with CV1, it is unclear why there does not seem to be any positive effect of CV1 on tumor clearance (is this because there are fewer T cells in the periphery as it is still early after damage?). On the thymic repair and reconstitution front, perhaps a cleaner way would be to look in Sirpa or CD47 deficient mice and without tumors.

      We agree that the findings regarding tumor immunotherapy need further explanation on detailed mechanism, therefore this part of results was removed from this project. CV1 treatment in our approach is ahead of tumor inoculation, therefore, CV1 mediated blockaded of CD47 (which is the case in CV1 mediated tumor clearance) would not occur on tumor cells. However, we did not test for the mechanism behind, which is quite interesting and would be done in future study.

      As to the suggestion of testing thymic regeneration in straightforward Sirpα or CD47 deficient mice, we have done this in Sirpα deficient mice. We conducted SL-TBI directly on Sirpα-/- mice and control mice. Congenically marked bone marrow cells were also adoptively transferred for better monitoring. At 4 weeks after transfer, donor derived DN thymocyte subset was found defective in Sirpα-/- recipients compared to that in control hosts (Figure R1). However, DP, SP subsets did not show difference, probably due to compensation effect. (Figure R1).

      Minor Comments:

      • In Fig. 2I (and Fig. 2S2I-J), it is difficult to determine how long after the chimera transplant the homing assays were performed. However, this approach has limitations as the process of creating those chimeras (conditioning such as irradiation etc.) will change the function and possibly the mechanisms of progenitor entry into the thymus. There is clearly still an effect of Sirpa in this context but it is possible (even likely) that the importation mechanisms in the thymus change after damage such as that caused by the conditioning required in the initial chimera generation.

      For the study of short-term homing in bone marrow chimeric mice, we have updated legends for the related figure (which is now Figure 2G in the article). The homing assays were performed at 8 weeks after the chimeric reconstruction. Meanwhile, it is indeed possible that the changes of the thymic homing mechanisms may give rise to the abnormal progenitor cells entry. In order to exclude this potential effect, we conducted homing assays without irradiation. In this experiment, we also observed impaired shortterm homing (new Figure 2I) and following T cell development (new Figure 2J,K)

      Furthermore, although using the Tie2-Cre strain will distinguish Sirpa on ECs and TECs, it will not distinguish between expression on other cells such as DCs (Tie2 will delete expression in both endothelial and hematopoietic lineages). Although the optimal experiment to address these concerns would be to delete Sirpa from ECs specifically (such as with Cdh5-CreERT2 mice), I am convinced by the preponderance of in vitro data that there is an EC-specific effect and therefore it is not necessary to perform this time-consuming, albeit interesting, potential experiment. However, these limitations should be acknowledged in the discussion or text.

      Thank you for your kind suggestion, we have discussed this limitation in the text.

      • As a technical note I am surprised that there was considerable reconstitution of naive T cells at day 21 after TBI (Fig.7G-H). In our experience that is very early for naïve T cells in the periphery which generally take about 4 weeks to start reconstituting in a real sense. Is it possible there are direct effects of this treatment on residual radio-resistant peripheral T cell numbers?

      Thank you very much for sharing your information. Indeed, we cannot exclude the possibility of residual radio-resistant peripheral T cells. To better clarify this, we have performed SL-TBI (6 Gy) followed by adoptive transfer of congenically marked WT (CD45.1) total bone marrow cells into Sirpα^-/- or control mice (CD45.2) for better monitoring. In this situation, we found that at day 28, more that 97% of thymocytes were donor-derived in both groups and the thymus had been completely reconstituted (Figure R2). In addition, as have been shown in Figure R1, donor-derived DN thymocyte subset was found significantly reduced in Sirpα^-/- mice compared to that in control mice. However, no defect was found at later development stages of thymocytes.

      Given the complication of the original experimental design, and as suggested by the reviewers, the original Fig. 7 was removed. The new data described above are hopeful informative to understand the role of SIRPα in a thymic regeneration scenario.

      Figure R4. Chimerism detection at day 28 in host transferred with bone marrow cells. (A) Chimerism of thymic subsets, chimerism=CD45.1^+%/(CD45.1+ %+CD45.2^+ %). (B) Representative FACS of donor (CD45.1) and host (CD45.2) cells in total thymocyte (single and live cell gated). n=6 in each group, unpaired t-test applied. **: p<0.01

    1. Author Response:

      Reviewer #1 (Public Review):

      This manuscript integrates conditional mouse models for TRAP, PAPERCLIP and FMRP-CLIP together with compartment specific profiling of mRNA in hippocampal CA1 neurons. Previously, similar approaches have been used to interrogate mRNA localization, differential regulation of 3'UTR isoforms, their local translation, and FMRP-dependent mRNA regulation. This study builds on these previous findings by combining all three approaches, together with analysis of mRNA dysregulation in Fmr1 KO neuron model of FXS. The strengths of the paper are the rich data sets and innovative integration of methods that will provide a valuable technical resource for the field. The weakness of the paper is the limited conceptual advance as well as lack of deeper mechanistic insights on FMRP biology over previous studies, although the present study validates and integrates past studies, adding some new information on 3'UTR isoforms.

      We appreciate the Reviewer’s recognition that “the present study validates and integrates past studies, adding some new information on 3'UTR isoforms”. We also appreciate the Reviewer’s recognition that “The strengths of the paper are the rich data sets and innovative integration of methods that will provide a valuable technical resource for the field.”

      We differ, however, with the concern that the work presents a “limited conceptual advance.” Specifically, we find, for the first time, that FMRP regulates two different biologically coherent sets of mRNAs in CA1 neuronal cell bodies and neurites. This provides a profound new insight into FMRP-RNA regulation, including the fact that these two different sets of mRNA targets (encoding chromatin-associated proteins and synaptic proteins, respectively) are both translationally regulated by FMRP and transcribed from genes implicated in autism.

      We recognize that FMRP was known, by our own work and that of others (as noted by the Reviewer) to regulate specific targets “in bulk” in neuronal cell types, brain and even in CA1 neurons. What is most unexpected here? Among directly bound FMRP mRNAs in brain CA1 neurons, there is subcellular compartmentalization of this regulation. This is new for FMRP, and in fact is new for RNA binding proteins more generally (recognizing of course the extensive work on RNA localization in different compartments previously discovered by others, beginning with Rob Singer’s work on actin localization and up to the present in work on neurons).

      We also think it is also important for readers to understand up-front the novelty in “combining approaches” referred to. We use cell-specific (cTag) CLIP to define direct FMRP interactions in subcompartments--dendrites vs cell bodies--of CA1 neurons within mouse brain hippocampus. We also normalize this data to ribosome-bound mRNAs in CA1 neurons, and validate observations by studying WT and FMRP-null brains. This set of complex mouse models and methods is completely new, and its application is what allowed us to make robust conclusions about FMRP translational regulation of different mRNAs in different cellular compartments.

      We strongly disagree with the Reviewer’s comment that FMRP directly interacts with functional classes of mRNAs in different cellular compartments “has previously been shown in the field.” Compartment-specific FMRP-CLIP has not been reported that we’re aware of, much less in a cell-type specific manner. Our previous cell-type specific FMRP-CLIP experiments have been on bulk neuronal material (Sawicka et al. 2019; Van Driesche et al., n.d.). Although cell-type specific TRAP-seq has been performed on microdissected CA1 compartments (Ainsley et al. 2014), investigators were unable to isolate significant amounts of RNA from resting neurons, and degradation of the isolated RNAs did not allow the types of 3’UTR and alternative splicing analyses that were performed here. The Schuman group has performed extensive analysis of mRNAs from microdissected CA1 compartments (Cajigas et al. 2012a; Tushev et al. 2018), but have not performed FMRP-CLIP or any experiments using cell-type specific or direct protein-RNA regulatory methods. In vitro systems have been used to analyze mRNA localization in FMRP KO systems (i.e. (Goering et al. 2020)), but in vitro systems are unable to fully recapitulate the complexities of in vivo brain regions, and did not analyze direct RNA-protein interactions. As our work is on in vivo brain slices, is cell-type specific, and integrates TRAP-seq, PAPERCLIP and CLIP-seq datasets, we believe that our work is novel and will be of great interest to the field.

      Despite the fact that FMRP targets are overrepresented in the dendritic transcriptome, it does not appear from this study that FMRP plays an active role in the mechanism of dendritic mRNA localization, at least under steady state conditions. One goal of the manuscript is to address a major question in the mRNA localization field, which is how FMRP may differentially modulate "localization" of functional classes of mRNAs such as those encoding transcriptional regulators and synaptic plasticity genes (Line 78-90). The data here indicate that FMRP directly interacts with functional classes of mRNAs in different cellular compartments, which has previously been shown in the field. However, no evidence is provided that mechanistically reveal a role for FMRP to promote subcellular localization of different functional classes of mRNAs. The correlative evidence presented in this manner does not add mechanistic insight.

      We do recognize that the question of what localizes FMRP mRNA targets differentially in the dendrite (and cell body) is of great interest, and remains unanswered. We also appreciate that, despite the Reviewer’s comment above, they also recognize “it does not appear from this study that FMRP plays an active role in the mechanism of dendritic mRNA localization, at least under steady state conditions.”

      We believe that some of the confusion here lies in the Reviewer’s comment “One goal of the manuscript is to address a major question in the mRNA localization field, which is how FMRP may differentially modulate "localization" of functional classes of mRNAs such as those encoding transcriptional regulators and synaptic plasticity genes (Line 78-90).” While this is a question of interest that has been studied, we think there is a major disconnect here in the Reviewer’s comments and our findings. To be clear, in the original manuscript, we did not find evidence, in WT vs KO CA1 neurons, that FMRP was acting to differentially localize mRNAs, including those mentioned by the Reviewer.

      Nonetheless, to further address the issue of a possible role for FMRP in localizing the transcripts it regulates, we have now performed quantitative analysis of FMRP target mRNA localization in dendrites from WT vs. Fmr1 KO mice. These results are now presented in Supplemental Figures 9 and 10 of the manuscript, and which we present and summarize below.

      Supplemental Figure 9. FMRP is not required for localization of its targets into the dendrites of CA1 neurons. A) Dendrite-enriched mRNAs were defined in FMRP KO mice (red) in the same manner as in Figure 1 for FMRP WT animals using bulk RNA-seq and TRAP-seq data. Overlap with dendrite-enriched mRNAs in WT (Figure 1, shown here in green) and CA1 FMRP targets (blue) in shown. 95.6% of dendrite-enriched FMRP targets in the WT were also found to be enriched in the dendrites of FMRP KO animals. B) Dendrite-present mRNAs were defined in FMRP KO. Overlap with dendrite-present mRNAs in WT (Figure 1) and CA1 FMRP targets is shown. 95.7% of dendrite-present FMRP targets in WT are also to be found as dendrite-present in KO animals. C-E) FISH was performed to assess FMRP target localization (Kmt2d (C) , Lrrc7 (D) and Map2 (E)) in FMRP KO mouse brain slices. Left panel shows the proportion of detected mRNAs that were detected in the neuropil (> 10 um from the predicted Cell bodies layer) in WT and KO animals. Wilcoxon ranked sum was performed to detect significance. Middle panel shows densitometry of 1000 spots samples from each picture analyzed. Distance from the CB was determined as described in methods and Figure 1. In the right panel, spots were binned into 15 groups according to the distance traveled from the CB, and the fraction of spots in each genotype in this range was analyzed by t-test to determined differences in the fraction of spots at each location in FMRP WT and KO animals (* indicates p-value < .05, ** is < .01).

      Supplemental Figure 10. FMRP is not required for differential localization of 3’UTR isoforms of its targets. A) Differential 3’UTR usage was analyzed using DEXseq as described in Figure 2 to identify 3’UTRs whose ratio of usage between neuropil and CB in FMRP WT and KO animals were altered. Shown is results from DEXseq analysis showing the log2foldChange (neuropil vs cell bodies, KO vs WT) and -log10(p-value) of each 3’UTR. Gray spots indicate that all 3’UTRs analyzed have an FDR > .05, indicating no significant change in usage between FMRP KO and WT animals. B and C) FISH analysis of localization of 3’UTR isoforms of Cnksr2 (B) and Anks1b (C ) isoforms in FMRP WT and KO animals. These genes were found in Figure 2 to express 3’UTR isoforms that are differentially localized to dendrites. Sequestered isoforms are those that are significantly localized to cell bodies in FMRP WT, and Localized are those that are significantly used in the dendrites of WT CA1 neurons. Left panel, the fraction of spots that are found to be localized to the neuropil (> 10 um from the cell body layer) are shown for each isoform in FMRP WT and KO animals. Differences were assessed by wilcoxon ranked sum tests. Middle panel, densitometry of the distance traveled from the cell bodies for a representative 1000 spots from each picture that was analyzed. Right panel, as described in Supplemental Figure 9, detected mRNAs were binned into 15 bins according to the distance traveled from the cell bodies, and differences in the fractions of spots in each bin in FMRP WT and KO slices were analyzed. Significance indicates results of t-tests (* indicates p-value < .05).

      In summary, we characterized the dendritic transcriptome in FMRP KO animals, and compared it to the FMRP WT results presented in Figures 1 and 2, as suggested by the Reviewers. We find that the dendritic transcriptome of FMRP KO animals is extremely similar to that of FMRP WT animals, with ~95% of mRNAs found to be dendrite-present or dendrite-enriched in WT also being found in FMRP KO animals (Figure S9). We validated these results with FISH and found no evidence for significant disruption in the localization of FMRP targets Kmt2d (Figure S9C), Lrrc7 (Figure S9D) or Map2 (Figure S9E) to the CA1 neuropil.

      To detect FMRP-dependent changes in distribution of 3’UTR isoforms of FMRP targets, we first performed global analysis of 3’UTR usage in TRAP from FMRP KO animals, using the expressed 3’UTR isoforms that were found in Figure 2. DEXseq analysis on 3’UTR expression in CA1 neuropil vs cell bodies TRAP showed no significant instances of altered 3’UTR usage ratios in FMRP KO animals (Figure S10A). We validated these results by performing FISH on the sequestered and localized 3’UTR isoforms of Cnksr2 and Anks1b genes and show no significant changes in the localization of the 3’UTR isoforms in FMRP KO animals (Figure S10B-C). Taken together, this data suggests that FMRP is not significantly involved in localization of its targets in resting CA1 neurons, but rather shows remarkable selection for localized mRNA isoforms. Instead, we find evidence that FMRP regulates the ribosome association of its targets in a compartment-specific manner by showing an increase in ribosome association of a subset of FMRP targets in the dendrites of CA1 neurons (see Figure 7E).

      Besides the addition of the figures described above, we have also now made corrections to the text of the manuscript, enumerated below, to address this.

      First, we have, as much as possible, reduced our emphasis throughout the manuscript on the “localization” of mRNAs and rather point out that the study seeks to characterize the differences between the regulated transcriptomes in CA1 cell bodies and dendrites. For example, for Figure 4, instead characterizing the log2FoldChange (neuropil vs CA1 cell bodies) as “dendritic localization”, we change the wording to “relative dendritic abundance” to focus on changes in the abundance of these transcripts in the dendrite vs the cell bodies. We also changed the section heading in the results that describes analysis in the FMRP KO animal from “Dysregulation of mRNA localization in FMRP KO animals” to “FMRP regulates the ribosome association of its targets in dendrites”. We believe that these changes will help to clear up this confusion for the reader.

      Second, we reformatted the model in Figure 7F. The new version of the model (shown here) emphasizes the point that our study reveals compartment-specific FMRP regulation of a subset of its targets without implying a role for FMRP in the mRNA localization of these transcripts. The text of the manuscript and figure legends have been updated accordingly.

      Figure 7F Distinct, compartment-specific FMRP regulation of functionally distinct subsets of mRNAs in CA1 cell bodies and dendrites. In dendrites, the absence of FMRP increases the ribosome association of its targets; this finding is consistent with a model in which FMRP inhibits ribosomal elongation and thereby translation (J. C. Darnell et al. 2011). In resting neurons, the translation of FMRP-bound mRNAs encoding synaptic regulators (FM2 and FM3 mRNAs) is repressed. When FMRP is absent, due to either genetic alteration (FMRP KO or FXS) or neuronal activity-dependent regulation (e.g. FMRP calcium-dependent dephosphorylation (Lee et al. 2011; Bear, Huber, and Warren 2004), ribosome association and translation of targets are increased. In cell bodies, FMRP binds mRNAs that encode for chromatin regulators (the FM1 cluster of FMRP targets), as well as FM2/3 mRNAs (consistent with synapses forming on the cell soma). FM1 targets show patterns of mRNA regulation similar to what our group observed in bulk CA1 neurons: FMRP target abundance is decreased in FMRP KO cells, perhaps due to loss of FMRP-mediated block of degradation of mRNAs with stalled ribosomes (Sawicka et al. 2019; R. B. Darnell 2020).

      Third, we have revised the Discussion in order to more completely discuss the model above and also emphasize the finding that FMRP was not found to be involved in the localization of its mRNA targets, but rather in the regulation of the local translation of its targets in a compartment-specific manner. We further speculate on the roles of FMRP in regulation of mRNA abundance and translation in these compartments.

      We hope that these changes better reflect the interpretation and novelty of our findings for both the Reviewers and the readers.

      Further related to a role of FMRP in mRNA localization, a recent paper in eLife reports that FMRP RGG box promotes mRNA localization of a set of FMRP targets through G-quadruplexes (Goering et al 2020). This relevant paper needs to be cited and discussed.

      We apologize for this omission, and have now cited and discussed this paper in the Results and Discussion of the manuscript. Importantly, we find that dendrite-enriched mRNAs have high GC content (see figure below, which is now Supplemental Figure 5). This complicates the discovery of potential G-quadruplexes; put another way, G-rich mRNAs will therefore be enriched when compared to not-localized mRNAs, and this is also true for C-rich mRNAs. Dendrite-enriched FMRP directly-bound CA1 neuronal targets (defined by CLIP) are actually G-poor when compared to dendrite-enriched FMRP non-targets (see new Figure S5 and below).

      Supplemental Figure 5A-D: Dendrite-enriched are GC rich and dendrite-enriched FMRP targets are GC poor compared to dendrite-enriched non FMRP targets. A) Schematic of the overlap between CA1 FMRP targets and dendrite-enriched mRNAs (defined in Main Figure 1) B) GC content, as defined by percent G + C for all CA1 mRNAs, dendrite enriched mRNAs (1211), dendrite-enriched FMRP targets (413), and dendrite-enriched non-FMRP targets (798, see A). Stars indicate significance in wilcoxon rank sum tests ( is p < .05, ** is p < .0001). C) G content, as defined by percent G, D) C content, as defined by percent C.

      In light of these observations, analysis of G- or C- containing motifs needs to be examined in this context. To this end, we performed the experiments suggested here, but did so by searching for the prevalence of G-quadruplexes in dendrite-enriched FMRP targets versus dendrite-enriched FMRP non-targets (Figure S5A). To do this, we used both experimentally-defined G-quadruplexes (described in (Guo and Bartel 2016), Figure S5E), as well as motifs (described in (Goering et al. 2020), Figure S5F). We include the results below, and in a new Figure S5 in the paper.

      Supplemental Figure 5E-F: mRNAs containing G-quadruplexes are not enriched in dendritic FMRP targets vs dendrite-enriched non-FMRP targets. E) The percent of all CA1 mRNAs, all dendrite-enriched mRNAs, dendrite-enriched FMRP-bound targets (413), and dendrite-enriched non-FMRP targets (798) that contain experimentally-defined G-quadruplexes is plotted. Shown are the results of chi-squared analysis comparing the enrichment of G-quadruplex containing mRNAs in dendrite-enriched FMRP targets vs dendrite-enriched non-FMRP targets. F) As in E, except looking for the presence of mRNAs with G-quadruplex motifs in 3’UTRs as described in (Goering et al. 2020)

      Interestingly, we found no difference in the presence of G-quadruplex motifs in the 3’UTRs of these two sets (above and new Supplemental Figure 5). For example, of 413 dendrite-enriched FMRP targets, 100 (24%) had experimentally defined G-quadruplexes in the 3’UTRs, while 159 (22.5%) dendrite-enriched non-FMRP targets had experimentally defined G-quadruplexes. These differences were not significant (by chi-square test).

      Searching the 3’UTR sequences of 413 dendrite-enriched FMRP targets above for G-quadruplex motifs (as described in (Goering et al. 2020), which searched for an empirically derived specific motif: GW--G, separated by 7nt), we only found 3 instances in dendrite-enchriched FMRP-bound target mRNAs. Similarly, we found out of 798 non-FMRP targets, only a small subset (6) contained this specific motif in their 3’UTRs. These results were not significant (chi-square test).

      In summary, we do not find evidence in our data of G-quadruplexes playing a role in determination of FMRP binding in CA1 dendrites. This data is now included in the results and discussed in the Discussion of the paper.

      Reviewer #2 (Public Review):

      The authors performed transcriptomic analyses from compartment-specific, micro-dissected hippocampal CA1 region tissue from transgenic mice. One feature that distinguishes this work from previous studies is the use of conditional knock-in of tags (GFP or HA) and tissue specific expression of the Cre recombinase to target a very specific population of pyramidal neurons in the CA1 region--as well as the combined use of TRAPseq, PAPERCLIP and FMRP-CLIP. Also, central to this work are the analysis pipelines that look at large populations of mRNA with the goal of finding features shared by those mRNA that bind FMRP.

      First, they established the identity of mRNAs that are dendritically enriched or/and alternatively polyadenylated (APA) by sequencing; followed by validation of a few candidates using smFISH. Next, the APA data was filtered through the rMATS statistical program to identify alternatively spliced (AS) mRNA variants within the APA population. The authors concluded that the majority of splicing events were of the exon-skipping type with NOVA2 as the likely culprit leading to this differential localization of AS isoforms. The authors then proceeded to perform FMRP-CLIP which was analyzed against the TRAP dataset. The (413) mRNAs that were shared by the two experiments (TRAP and FMRP-CLIP) exhibited two notable features: dendrite-enrichment and longer average transcript length. More importantly, They demonstrated that FMRP can preferentially bind to an AS isoform that is enriched in dendrites. Further analyses of FMRP CLIP targets showed that they shared a significant level of genes designated by gene set enrichment analysis (GSEA) as involved in ion transport and receptor signaling and similarly for ASD-related candidate genes.

      Strengths: -The combined use of tissue-specific Cre and conditional tags for RPL22, PABPC1 and FMRP help make these pull-downs highly specific and robust. -RNA sequencing approach allows for identification and comparison of populations of ribosome-, PABPC1- and FMRP-associated mRNAs. -Preferential binding of FMRP to AS or APA isoforms in dendrites is an impactful and significant finding.

      Weaknesses: -A caution in interpreting comparative or differential RNA-sequencing results as some are correlative.

      We appreciate this concern, and agree that RNA-seq analysis alone can be difficult to interpret. However, we feel that our unique approach of combining multiple cell-type specific approaches, including CLIP-seq and PAPERCLIP along with TRAP-seq and RNA-seq result in stronger conclusions that are supported by multiple lines of evidence.

      -Validation of FMRP interaction with AS or APA isoforms or ASD candidates by smFISH-IF is lacking.

      We find that smFISH-IF in the CA1 neuropil is difficult to interpret in mouse brain slices due to dense networks of processes in addition to contaminating cell types, making IF signals dense, noisy and difficult to quantitate. Although we could theoretically attempt these experiments using an in vitro cell culture model, we believe that the novelty of our work is in a) the cell-type specific nature of our analyses and in b) the fact that our analysis and validation is all performed in vivo. We do not feel confident that in vitro systems are similar enough to our in vivo system to be relevant for this work. This is due not only to differences in their transcriptomes, but also due to the limited number of synapses in vitro cells make with other neurons when compared to CA1 neurons in the brain. Instead, we validate the interactions between FMRP and AS and APA isoforms by isolating junction reads among FMRP-CLIP tags isolated in a cell-type specific manner from intact mouse brains (Figure 5). In this manner, we find direct evidence of FMRP selectively binding to dendritic mRNA isoforms in vivo.

      -Although hippocampal CA1 region is an excellent site to study FMRP-RNA interactome, are there other projection systems where altered FMRP-RNA interaction may lead to greater dysfunction?

      We appreciate this point and now include this in the revised Discussion.

    1. Author Response

      Reviewer #1 (Public Review):

      Identifying private peptides for generating personalised cancer vaccines is a promising approach to launch robust anti-tumor response; however, the challenges remain in developing an effective process to achieve that. In this manuscript, the authors present an interesting and powerful pipeline (PeptiCRAD) to achieve this goal by examining CT26 model. Overall, this manuscript is well written and presented. Despite that this work presents interesting findings and pipeline, I have the following concerns. I do feel that this manuscript will improve if these concerns can be addressed.

      We thank the reviewer very much for having appreciated the quality and the originality of our work.

      1. It will be critical to confirm TILs and T cells in draining lymph nodes indeed recognise the peptide used in Figure 7-8 by ELISPOT of IFNg.

      We agree with the reviewer´s comment as regard to confirming that TILS and T cells in draining lymph nodes recognise the peptide used in Figure 7-8 by functional characterization in an ELISPOT IFN-γ assay. In our experience, the ELISPOT assay works at the best when fresh samples are employed; additionally, the splenocytes are source of enough cells to test individual mouse reactivity to single peptide. To this end, as the samples from figures 7 and 8 were frozen, we decided to repeat the animal experiment according to figure 7 schedule treatment to perform then the ELISPOT on splenocytes freshly harvested from mice. Following the previous results, we selected the best group (PeptiCRAd1) to further investigate the peptide response; untreated mice (Mock) and Virus alone (VALO-mD901) were used as control as well. Interestingly, the peptide deconvolution showed T cell reactivity to one peptide (RYLPAPTAL, peptide 2) (Figure 1A) in the PeptiCRAd1 group, in contrast no T-cell reactivity was observed for SYLPPGTSL (peptide 1) (Figure 1B). These data highlighted the role of an individual antigen in eliciting specific anti-tumor T cell response, appearing an interested candidate for further proof of concept in animal experimental setting.

      Figure 1 Interferon-γ Elispot results Harvested splenocytes from the treatment groups (as indicated in the figure) were functional characterized in an IFN-γ ELISPOT assay; individual response to SYLPPGTSL A) and RYLPAPTAL B) for each mouse is reported as IFN-γ spot forming cells (SFC)/106 splenocytes. The data are depicted as single dots plot and mean + SEM is shown. (Virus=VALO-mD901, PC=PeptiCRAd).

      1. It would be interesting to see if this pipeline can be used to identify human peptides in human melanomas.

      We thank the reviewer for pointing out that this pipeline can be used to identify human peptide in human melanomas. Indeed, the work here described is a proof concept meant to be translated in human setting. To this end, in the lab we have two projects on-going that are exploiting the same pipeline to investigate the human epithelioid and human mesothelioma ligandome landscape. Regarding this latter, we are investigating four different human cell lines (H2B, MSTO211H, H2452 and JL1). As shown in the picture below (Figure 2), the peptide length distribution showed an enrichment in 9mres in both replicates (Rep1 and Rep2), in line with a ligandome profile. The analysis of the binders revealed that most of them were good binders (according to EL-Rank score) for at least one of alleles for each cell line. Following the pipeline reported in this manuscript, to select candidate peptides we applied two different approaches; the first approach relied on RNA seq analysis to check which source proteins of the peptides isolated in the ligandome analysis were reported as upregulated or downregulated in resected tumor compared to normal tissues.

      (Fromhttps://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51024 GSE51024). The second approach (analysis still on-going) will be the use of HEX software.

      Regarding the epithelioid project, we have analysed the ligandome profile of two human cell lines: NEPS and VA-ES-BJ. Please find below the example of the ligandome analysis for VAES- BJ cell line (Figure 3). Overall, the analysis outcome was similar to published dataset (aminoacidic length distribution, Gibbs clustering profile, amount of binders) confirming the good quality of the ligandome landscape identified. Next, we applied HEX analysis to narrow down the list of peptide candidates for further test. We are currently in the stage of collecting more different epithelioid cell lines to expand our cohort of samples.

      Figure 2 Mesothelioma project (data not published)

      Figure 3 Epithelioid project (data not published)

    1. Author Response:

      Reviewer #2:

      The SNX-BAR family of sorting nexin proteins is involved in the formation of tubular carriers at endosomes. The best characterized yeast sorting nexins form part of the retromer complex, which binds sorting signals on cargo proteins to direct their recycling. There is some debate as to the role of sorting nexins in mediating cargo recognition vs tubule formation, and it is unclear which (if any) other members of the sorting nexin family bind directly to cargo.

      In this manuscript, the authors investigate the function of the yeast sorting nexin Mvp1. This protein was previously proposed to cooperate with retromer in the formation of recycling tubules, and to recruit the dynamin-like protein Vps1 to promote their scission (Chi et al, JCB 2014). Here, Suzuki et al find that Mvp1 has a cargo-sorting role that is distinct from that of other sorting nexins. They show that Mvp1 (but not retromer) is required for the correct localization of the membrane protein Vps55, and identify a cytosolically-exposed sequence in Vps55 required for its sorting. Using structurally-guided mutagenesis, they find that dimerization and membrane binding is important for Mvp1 function. They use live cell imaging to show that Vps55 is largely sorted into different tubules compared to the retromer cargo protein Vps10, and use fractionation of vesicle fusion-deficient cells to show these cargo are present in different vesicle populations, suggesting that Mvp1 and retromer form different classes of retrograde carriers. By surveying the trafficking of other membrane proteins, they show that in some cases Mvp1 acts redundantly with two other sorting nexin complexes (Snx4 and/or retromer) to recycle cargo at endosomes. Moreover, they find that loss of all three sorting nexin complexes perturbs endosome function, lipid asymmetry, and the endosomal recruitment of the scission factor Vps1. Although Mvp1 was previously implicated in Vps1 recruitment (Chi et al, 2014), Suzuki et al use a GTPase-defective form of Vps1 to provide the first evidence that Mvp1 physically interacts with Vps1 in vivo and in vitro. Taken together, these data suggest that Mvp1, retromer and Snx4 recognize distinct sets of cargo proteins and mediate independent recycling pathways at endosomes, and imply that each sorting nexin recruits Vps1 to complete tubule scission.

      Overall, this manuscript presents a large number of experiments that are technically well executed and makes several novel observations. It should be noted that many experiments largely repeat previous work: this was not always clearly indicated in the manuscript. For the most novel observations, some weaknesses were noted. A key novel finding was that Mvp1 binds to and sorts the cargo protein Vps55 via recognition of a cytosolic motif. The supporting data do not provide the typical burden of proof for such experiments, because: (1) the identified sequence was shown to be necessary but not sufficient, thus the mutation could indirectly affect binding at another site, and (2) Mvp1 failed to coIP with the Vps55 mutant from cell lysates, but this could be an indirect effect of Vps55 missorting to the vacuole while Mvp1 remains at the endosome, and does not prove that Mvp1 binds directly to Vps55 via this motif.

      Thank you for pointing this out. As mentioned above, to address your point, we examined the Mvp1-Vps55 interaction in cells lacking Vam3, required for endosome fusion with the vacuole. In this mutant, both WT and recycling mutants localize at the endosome (Fig. Rev. 1C). We confirmed that mutations in the recycling sequence altered the Mvp1-Vps55 interaction even in vam3Δ cells (Figure 3-figure supplement 1C was added to the revised manuscript). To address whether the recycling signal is sufficient for Mvp1-mediated recycling, we tried to generate several chimera constructs, but we did not obtain a construct recycled in Mvp1 dependent manner. Hence, we were not able to address this point.

      A second key finding is that Mvp1 and retromer form distinct classes of tubular carriers at endosomes. While the manuscript does provide data to support this conclusion, I was disappointed that there was no discussion of the work of Chi et al, who showed through careful quantitative analysis that Mvp1 and retromer frequently label the same population of tubules.

      Thank you for pointing this out. In the revised manuscript, we have also discussed the differences with Chi et al. in the text (Page 13, line 408).

      Moreover, the authors claim that mvp1 mutants secrete little CPY, yet the literature indicates these mutants secrete ~65% of newly synthesized CPY (Ekena and Stevens, MCB 1995), suggesting a functional link between Mvp1 and Vps10 recycling. In fact, vps55 mutants themselves have a significant CPY missorting defect (~50% secreted) suggesting that some mvp1 phenotypes could be a secondary consequence of Vps55 mislocalization.

      Thank you for pointing this out. We examined the CPY sorting in the recycling signal mutants. Strikingly, CPY was partially missorted to the extracellular space in vps55Y61A/T63A/F66A/M67A mutants (Fig. Rev. 6). Since Vps10 recycling was not altered in mvp1Δ cells (Figure 5A), we believe that the mislocalization of Vps55 causes the CPY sorting defect in mvp1Δ cells.

      It was not mentioned that Vps55 interacts with the transmembrane protein Vps68: these proteins are interdependent for their stability and loss of Vps68 slows traffic out of the endosome (Schluter et al MBOC 2008). This provides a simple explanation for the observed ubiquitination and degradation of overexpressed Vps55, which presumably saturates available Vps68.

      As suggested by the reviewer, we have revised the manuscript (Page 5, line 158). Also, as mentioned above, we observed that Vps55 missorting was suppressed by overexpression of Vps68 (Figure 3-supplement 1E was added to the revised manuscript), suggesting that Vps68 was saturated in this condition.

      Other experiments in this manuscript were not completely novel, including: the demonstration that Mvp1 tubules bud from endosomes and that Mvp1 is important for Vps1 recruitment to endosomes (Chi et al, JCB 2014); that Vps1 GTPase mutants accumulate Mvp1 at endosomes (Ekena and Stevens, MCB 1995); that Mvp1 plays a role in Vps55 localization (Bean et al, Traffic 2017); and that GFP-SNX8 is present on endosomal tubules when expressed in mammalian cells (van Weering et al, Traffic 2012). While in most cases the experiments presented in this manuscript build on and extend previous work, I would like to see the earlier work fully acknowledged, and any discrepancies appropriately discussed. The fact that many of the experiments presented in this manuscript are not entirely novel detracts from the overall impact of the work. Despite this, key original findings presented in this paper - including the discovery that Mvp1 is required for sorting specific cargo and binds directly to the dynamin-like protein Vps1 - will be of broad interest to the trafficking field.

      Thank you for pointing this out. In the revised manuscript, we have carefully revised the manuscript (Page 5, line 133; Page 8, line 236; Page 13, line 414; Page 12, line 377).

    1. Author Response:

      Reviewer #1 (Public Review):

      This study demonstrates with analyical methods and simulations a new approach to estimate pairwise noise and signal correlations in two-photon calcium imaging data. This approach compensates for biases introduced by the dynamics of calcium signals, without deconvolution and for low trial numbers. Simulations based on idealized calcium signals demonstrate the efficiency of the method, and application to auditory cortex imaging data leads to mild changes in the results shown in the past based on less accurate estimates. This study has the merit to identify biases that can arise when evaluating noise and signal correlations across neurons with indirect signals. Moreover the solution provided, may become a useful addition to the neuroscientist's signal analysis toolbox. Noise and signal correlation are related to fonctional connectivity between neurons, and thereby give insights about the fonctional structure of the underlying network. They do not necessarily account for the full complexity of neural interactions but are used in numerous studies, which would be improved by this tool. A potential improvement of the study could be to indicate how this approach could be generalized to other neuron to neuron interaction measurements or data-driven neural network modeling.

      We would like to sincerely thank Reviewer 1 for his supportive stance towards our work, and for providing helpful feedback to improve our manuscript

      The main weakness of the study is that the efficency of the method is only assessed with simulated datasets. Finding real ground-truth data for a validation beyond that would be difficult if not impossible. However, authors could further convince the reader by showing the effect of relaxing certain assumptions of their surrogate data generation model (e.g. absence of temporal correlation in measurement noise), and show the robustness and limits of the methods.

      Thank you for this suggestion. Motivated by this comment, and a related comment by Reviewer 2, we have now substantially enhanced our performance analyses in the revised manuscript and compiled them in a new subsection titled “Analysis of Robustness with respect to Modeling Assumptions” for better clarity and consistency. In summary:

      1) We first examined the robustness of our proposed method with respect to model mismatch in the stimulus integration model. As suggested, we generated data according to a non-linear (i.e., quadratic sum of linear filters) receptive field model:

      but assumed a linear stimulus integration model in our inference procedure

      The comparison of the correlations estimated under this setting by each method are shown in Figure 2 – Figure Supplement 3. While the performance of our proposed signal correlation estimates under this setting degrade as compared to that in Figure 2 with no model mismatch, our proposed estimates still outperform the other methods and recovers the ground truth signal correlation structure reasonably well.

      It is noteworthy that the model mismatch in the stimulus integration component does not affect the accuracy of noise correlation estimates in our method, as is evident from the noise correlation estimates in Figure 2 – Figure Supplement 3. In comparison, the biases induced in the other methods due to model mismatch and various other factors such as observation noise, temporal blurring, undermining non-linear mappings between spikes and underlying covariates, results in significantly larger errors in both signal and noise correlation estimates.

      2) We incorporated our previous analysis of robustness with respect to calcium decay model mismatch in this subsection, which is shown in Figure 2 – Figure Supplement 4.

      3) In response to a related comment by Reviewer 2, we then performed extensive simulations to evaluate the effects of SNR and firing rate on the performance of our method. Overall, while the performance of all algorithms degrades at low SNR or firing rate values (SNR < 10 dB, firing rate < 0.5 Hz), our algorithm outperforms the existing methods in a wide range of SNR and firing rate values considered. The results are summarized in Figure 2 – Figure Supplement 5.

      4) Finally, we considered two observation noise model mismatch conditions, namely, white noise + low frequency drift and pink noise, similar to the treatment in Deneux et al. (2016). For each noise mismatch model, we also varied the SNR level and firing rate and compared the performance of the different algorithms as reported in Figure 2 – Figure Supplement 6. These new analyses demonstrate that our proposed estimates outperform the existing methods, under correlated generative noise models, and also with respect to varying levels of SNR and firing rate. As clearly evident in panels C and F of Figure 2 – Figure Supplement 6, even though the estimated calcium concentrations are contaminated by the temporally correlated fluctuations in observation noise, the putative spikes estimated as a byproduct of our iterative method closely match the ground truth spikes, which in turn results in accurate estimates of signal and noise correlations.

      To address this comment, we performed extensive simulations to evaluate the robustness of different algorithms under model mismatch conditions induced by 1) non-linearity in the stimulus integration model, 2) calcium decay, 3) SNR and firing rate, and 4) temporal correlation of observation noise. We have now compiled these results in a new subsection called “Analysis of Robustness with respect to Modeling Assumptions” (Pages 6-7).

      Also further intuitions about why this method outperform others would be of great help for the non-specialist readers.

      Thank you for this suggestion. There are two sources for the performance gap between our proposed method and existing approaches:

      1) Favorable soft decisions on the timing of spikes achieved by our method, as a byproduct of the iterative variational inference procedure: an accurate probabilistic decoding of spikes results in better estimates of the signal/noise correlations, and conversely having more accurate estimates of the signal/noise covariances improves the probabilistic characterization of spiking events. This is in contrast with both the Pearson and Two-Stage methods: in the Pearson method, spike timing is heavily blurred by the calcium decay; in the two-stage methods, erroneous hard (i.e., binary) decisions on the timing of spiking events result in biases that propagate to and contaminate the downstream signal and noise correlation estimation and thus result in significant errors.

      2) Explicit modeling of the non-linear mapping from stimulus and latent noise covariates to spiking through a canonical point process model (which is in turn tied to a two-photon observation model in a multi-tier Bayesian fashion) results in robust performance under limited number of trials and observation duration. As we have shown in Appendix 1, as the number of trials L and trial duration T tend to infinity, conventional notions of signal and noise correlation indeed recover the ground truth signal and noise correlations, as the biases induced by non-linearities average out across trial repetitions. However, as shown in Figure 2 - Figure supplement 2, in order to achieve comparable performance to our method using 20 trials, the conventional correlation estimates require ~1000 trials.

      To address this comment, we have now included the aforementioned items in the revised Discussion section, highlighting the key aspects of our method that makes it outperform existing approaches (Pages 17-18).

      Reviewer #2 (Public Review):

      This manuscript describes a new method for estimating signal and noise correlations from two-photon recordings of calcium activity in large neuronal networks. Unlike existing methods that first require inferring spikes from calcium transients before estimating the correlations, the proposed method performs the correlation estimation directly from the fluorescence traces. It treats the different inputs to each neuron as latent variables to be inferred from its observed fluorescence activity, and divides these inputs according to whether they are provided by stimulus-dependent (signal) or stimulus-independent (noise) inputs. The authors showed with simulations that proper definitions of signal and noise correlations based on these inferred variables converge with trial repetition much faster to the true correlations than conventional estimates. They are not sensitive to blurring produced by inaccurate spike deconvolution and are less prone to erroneously mixing the signal and noise components of the correlations. By applying this new method to real optical recordings from the auditory cortex of awake mice, the authors shed new light on the structure of the circuitry underlying the processing of sound information in this brain region. Circuits processing sound-related and sound-independent information appear to be more orthogonal than previously thought, with a spatial signature that changes between thalamorecipient layer 4 and supragranular layers 2/3.

      This is a mathematical manuscript that introduces a promising new analysis approach. It is designed to be applied to two-photon experiments, that typically produce recordings of calcium activity of several hundred of neurons simultaneously. Because of their massive parallel recordings, which do not rely on spike sorting to identify single units, these optical techniques naturally provide access to correlation between units. They have given rise to a field of active research that attempts to link these correlations to elementary functional circuits in the brain. However, as the authors point out, the low efficiency of spike inference from calcium traces raises the need for correlation estimation approaches that circumvent this problem, as the method presented here does. As such, it could have a significant impact if the community succeeds in using it (see below).

      We would like to sincerely thank Reviewer 2 for his/her supportive stance towards our work, and for providing helpful feedback to improve our manuscript.

      Weaknesses and strengths

      1) Public availability of the code implementing the new method is clearly necessary for the two-photon microscopy community to adopt it, and this is indeed the case at https://github.com/Anuththara-Rupasinghe/Signal-Noise-Correlation. However, it is also crucial that any end-user be able to get a clear picture of the conditions under which the method can or cannot be applied before diving in. The fact that such an applicability domain is not well defined is a major concern. Notably, each Real Data Study presented in the paper uses a preliminary selection of "highly active cells" (1rst study: N = 16; 2nd study: N = 10; 3rd study: N~20 per field), as the authors succinctly discuss that performance is expected to degrade "in the regime of extremely low spiking rate and high observation noise" (l. 518-519). But no precise criteria are provided to specify what is meant by "highly active cells". On the other hand, the authors also assume that there is at most one spiking event per time frame for each neuron, which seems to exclude bursting neurons. The latter assumption seems to be a challenge with respect to the example traces shown on Fig. 4C (F/F reaches 400%) and on Fig. 6C (F/F reaches 100%), considering that the GCaMP6s signal for a single spike is expected to peak below 10-20%. This forces the authors to take a scaling factor of the observations A = 1 x I (Real Data Study 1 and 3) or A = 0.75 x I (Real Data Study 2) compared to the A = 0.1 x I taken in the Simulation Studies. Therefore, it looks like if the Real Data Studies were performed on mainly bursting cells and each burst was counted as one spiking event. A detailed discussion of the usable range of firing rates, whether in spike or burst units, as well as the usable range of SNR should be added to the main text to allow future users to assess the suitability of their data for this analysis.

      Thank you for pointing out the issues related to the applicability domain of our method. We agree that clarifying the rationale behind our model parameter choices is key to facilitating its usage by future users. In response to this comment, we have made three major revisions:

      1) Adding a new subsection to the Methods and Materials called “Guidelines for model parameter settings” that includes our rationale and criteria for choosing the number of neurons (N), stim- ulus integration window length (R), observation noise covariance (Σ_w), scaling matrix A, state transition parameter (α), and mean of the latent noise process (μ_x);

      2) Inspecting the capability of our proposed method in compensating for rapid increase of firing rate;

      3) Performing extensive new simulations to evaluate the effect of SNR level and firing rate on the performance of our proposed method, included in a new subsection in the Results section called “Analysis of robustness with respect to modeling assumptions”.

      We will next describe these changes in a point-by-point fashion.

      -Criterion for selecting the number of neurons. While our proposed method scales-up well with the population size due to low-complexity update rules involved, including neurons with negligible spiking activity in the analysis would only increase the complexity and potentially contaminate the correlation estimates. Thus, we performed an initial pre-processing step to extract N neurons that exhibited at least one spiking event in at least half of the trials considered. This criterion is now clearly stated in the subsection “Guidelines for model parameter settings”. We have also reworded “highly active cells” to “responsive cells (according to the selection criterion described in Methods and Materials)” for clarity.

      -Evaluating the effects of SNR level and firing rate. We had previously noted that the performance degrades at low SNR and firing rate values, with little quantitative justification. In response to this comment, and a related comment by Reviewer 1, we performed extensive simulations to evaluate the robustness of the different methods under varying SNR levels, firing rates, and observation noise model mismatch (including white noise + drift and pink noise models). These results are included in a new subsection called “Analysis of robustness with respect to modeling assumptions” and shown in Figure 2 – Figure Supplement 5 and 6.

      While the performance of all methods (including ours) degrades at low SNR levels or firing rates (SNR < 10 dB, firing rate < 0.5 Hz), our proposed method outperforms the existing methods in a wide range of SNR and firing rate values and under the considered observation noise model mismatch conditions. To quantify this comparison, we have also indicated the mean and standard deviation of the relative performance gain of our proposed estimates across SNR levels and firing rates as insets in Figure 2 – Figure Supplement 5 and 6.

      -Choosing the scaling matrix A. In each case, we set A=aI, and estimated a by considering the average increase in fluorescence after the occurrence of isolated spiking events. Specifically, we derived the average fluorescence activity of multiple trials triggered to the spiking onset and set a as the increment in the magnitude of this average fluorescence immediately following the spiking event.

      -Compensation for rapid increase of firing rate. The comment of the reviewer regarding the sudden increase of ∆F/F in Fig. 4C prompted us to inspect the performance of the algorithm in such scenarios where the choice of A may underestimate the rapid increase of firing rate (e.g., A= I). In the new supplementary figure to Fig. 4, called Figure 4 – Figure Supplement 2, we show a zoomed-in view of the time-domain estimates of the latent processes obtained by our proposed method (replicated here for discussion):

      Notably, the fluorescence activity rises up to a magnitude of ∼ 14, while we have set a=1. Thus, as the reviewer pointed out, this activity is induced by a burst-like event due to successive closely-spaced spikes. Due to the low firing rate of A1 neurons, we believe this is not a bursting event (in the electrophysiological sense), but a rapid increase in firing rate that may result in the occurrence of more than one spike per frame. From the estimates of the latent calcium concentration (purple) and putative spikes (green), we clearly see that our proposed method is still capable of matching the observed fluorescence activity through two mitigatory mechanisms that we describe next:

      1) The proposed method predicts spiking events in adjacent time frames to compensate for rapid increase of firing rate (see the green trace following the vertical dashed line) and thus infers calcium concentration levels that match the observed fluorescence activity;

      2) Even though our generative model assumes that there is only one spiking event in a given time frame, this assumption is implicitly alleviated in our inference framework by relaxing the constraint

      as explained in the section Methods and Materials - Low-complexity parameter updates (Page 23). While this relaxation was performed in order to make the inverse problem tractable, we see that it in fact leads to improved estimation results under such settings, by allowing the putative spike magnitudes

      to be greater than 1, as it is also evident in the magnitude of the inferred spikes right after the rise of fluorescence activity (the horizontal dashed line corresponds to spiking magnitude equal to 1).

      We have now discussed this observation in the Results section (Page 10).

      To address this comment, we have added a new subsection to Methods called “Guidelines for model parameter settings” that includes our rationale and criteria for choosing key model parameters (Page 24), have performed new simulation studies to evaluate the effects of SNR and firing rate on the performance of the proposed method (Pages 6-7), and closely inspected the performance of our method under rapid increase of firing rate (Page 10).

      2) Another parameter seems to be set by the authors on a criterion that is unclear to me: the number of time lags R to be included in the sound stimulus vector st. It seems to act as a memory of the past trajectory of the stimulus and probably serves to enhance the effect of stimulus onset/offset relative to the rest of the sound presentation. It is consistent with the known tendency of neurons in the primary auditory cortex to respond to these abrupt changes in sound power. However, this R is set at 2 in the Simulation Study 1, whereas it is set at 25, in the Real Data Studies 1 and 3, and to 40 in the Real Data Study 2. What leads to these differences escaped to me and should be explained more clearly.

      Thank you for pointing out this lack of clarity in explaining the rationale behind choosing R. In addressing this comment, we have now added an entry in the new subsection “Guidelines for model parameter settings”. Furthermore, we have unified our choice of R in the three real data studies. We will explain these changes in a point-by-point fashion next.

      -Choice of R in simulation studies. The stimulus used in the simulation was a 6th-order autoregressive process whose present and immediate past values contributed to spiking in our generative model (i.e., R=2). Given that the ground truth value of R was known in the simulations, we used R=2 for inference as well.

      -Choice of R for real data application. The number of lags R considered in stimulus integration is a key parameter that can be set through data-driven approaches or using prior domain knowledge. Examples of common data-driven criteria include cross-validation, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), which balance the estimation accuracy and model complexity.

      To quantify the effect of R on model complexity, we first describe the stimulus encoding model in our framework. Suppose that the onset of the pth tone in the stimulus set (p=1,⋯,P , where P is the number of distinct tones) is given by a binary sequence

      The choice of R implies that the response at time t post-stimulus depends only on the R most recent time lags. As such, the effective stimulus at time t corresponding to tone p is given by

      By including all the P tones, the overall effective stimulus at the tth time frame is given by

      The stimulus modulation vector d_j would thus be RP-dimensional. As a result, the number of parameters (M=RP) to be estimated linearly increases with R. By using additional domain knowledge, we chose R to be large enough to capture the stimulus effects, and at the same time to be small enough to control the complexity of the algorithm.

      As an example, given that the typical response duration of mouse primary auditory neurons is < 1 s, with a sampling frequency of f_s=30 Hz, we surmised that a choice of R∼30 would suffice to capture the stimulus effects. We further examined the effect of varying R on the proposed correlation estimates in Figure 4 – Figure Supplement 1. As shown, small values of R (e.g., R = 1 or 10) may not be adequate to fully capture the effects of stimuli. By considering values of R in the range 25 − 50, we noticed that the correlation estimates remain stable. We thus chose R=25 for our real data analyses. Notably, the results of real data study 2 (that previously used R = 40) are nearly unchanged with the new choice of R=25, which is in accordance with our observation in Figure 4 – Figure Supplement 1.

      To address this comment, we have added a new subsection to Methods called “Guidelines for model parameter settings” (Page 24) that includes our rationale for choosing the stimulus integration window length R and have performed a new analysis to evaluate the effect of R on the performance of the proposed method in real data study 1 (Page 10).

      3) This memory of the past stimulus trajectory appears to be specific to the proposed method and is not accounted for in the 2-stage Pearson estimation, for example. Since it probably helps to reflect the common sensitivity of neurons to onset/offset, it alone provides an advantage to the proposed method over the 2-stage Pearson estimation. It would be instructive to also perform this comparison with R set to 1 to get an idea of the magnitude of this advantage.

      We agree that explicit modeling of stimulus integration is a key advantage of our proposed method in comparison to the conventional ones. We have now explained this virtue in the discussion of the role of R in real data study 1 (Page 10). Additionally, as explained in our responses to the previous comment, we have included a new analysis of the sensitivity of our proposed estimates to the choice of R as a supplementary figure to Figure 4. As the reviewer suggested, we see that R=1 indeed fails to capture the underlying structure in the signal correlations. However, when R is sufficiently large (R>20), the estimates become stable.

      To address this comment, we have now discussed the advantage of including the stimulus history in our model and probed the sensitivity of our estimates to the choice of R in Figure 4 – Figure Supplement 1 (Page 10).

      4) Finally, although the example of ground truth signal and noise correlation matrices taken to illustrate the method in the simulation study on Fig. 2A have been chosen to be with almost no overlap in their non-zero coefficients, there is no fundamental reason why this separation should be the rule for real data. These coefficients reflect the patterns of stimulus-dependent and stimulus-independent functional connectivity in the recorded network. As such, these patterns could have different degree of overlap, depending on the brain areas recorded. It is therefore particularly striking that the authors find in their data a strong dissimilarity and almost no covariance between signal and noise correlation coefficients, throughout all the different sets of experiments they present here (Fig. 4E, Table 1, 2, 3, and Fig. 6A&B). This makes a strong and compelling statement on the likely separation of the corresponding circuits in the primary auditory cortex of the mouse.

      We agree with the assessment of the reviewer. We suspect that some of the reported similari- ties between signal and noise correlations in existing literature could be due to leakage in estimating these two quantities, likely indued by limited number of trials, short observation duration, and undermining the effect of calcium dynamics and non-linearities.

      Likely impact on the field

      It is now well established that sound processing is modulated, even at the level of primary auditory cortex, by locomotion (Schneider et al. Nature 2018), task engagement (Fritz et al. Nat. Neurosci. 2003), or several other factors. Applying the proposed method to these situations could help understand how sound processing circuits are remodeled, without confounding other coexisting processes. In general, whenever a brain structure makes associations between multiple processes within the same network, the presence of multiple circuits makes the observation of correlations difficult to attribute to the signature of a single circuit. By significantly improving the estimation of signal and noise correlations, the proposed method should help distinguish the boundaries of these circuits as well as their intersections. The exploration of the role of many secondary sensory and associative cortical structures could be renewed by this work.

      We would like to thank Reviewer 2 again for his/her supportive stance towards our work and for fairly summarizing our contributions

    1. Author Response:

      Reviewer #1:

      The manuscript "Two different cell-cycle processes determine the timing of cell division in Escherichia coli" by Colin et al. presents an experimental approach to investigate the role of two governing cell-cycle processes, namely, DNA replication-segregation and cell division cycle, in size regulation. Authors tackle the problem by first decoupling these two cell-cycle process via sub-lethal dosages of A22, and then analyze the role of each process in the timing of cell division. Modern imaging and analysis techniques are used in this work to monitor cell division with single-cell resolution and chromosome replication with sub-cellular resolution. The large pool of data allows the authors to perform correlation analysis of cell-size and the cell cycle parameters, which led to the conclusion that the two processes have a "balanced contributions in non-perturbed cells."

      The question studied in this manuscript is important and timely. The investigation of the two concurrent processes chosen by the authors is perhaps the right direction which may eventually lead to a complete understanding of the E. coli cell-cycle and size regulation. The high-resolution imaging and analysis accomplished in this work is also commendable. There is, however, a major concern about this manuscript, which is the entire conclusion is based on the cell-cycle and size perturbations by A22. The caveat of the A22 perturbations is that an aberrant cell shape could affect both of the cellular processes simultaneously. Even though the C-period and initiation size are largely unchanged, a possible, but unknown, cross-talk between the two processes may be affected by A22. Therefore, additional evidence is necessary to show whether the two processes independently determine cell division.

      We agree that A22 treatment could possibly affect DNA replication or organization, e.g., indirectly through an effect of cell width on DNA organization. It would thus indeed be desirable to confirm our findings based on alternative perturbations. At the same time, our experiments clearly demonstrate that cell sizes at replication initiation and division are decreasingly correlated with increasing A22 concentration, which suggests that a process different from DNA replication is responsible for the timing of division.

      Additionally, DNA replication could depend on cell division, which could possibly complicate the relationship between replication and division. We have now addressed the possibility of an influence on division on replication initiation in the Discussion, where we write ‘The concurrent-cycles framework assumes that replication initiation is independent of cell division or cell size at birth, [...]. However, we note that this is not the only possibility, and DNA replication may not be entirely independent of cell division. A complementary hypothesis \citep{Kleckner2018} posits a possible (additional or complementary) connection of initiation to the preceding division event. To test this hypothesis one could perturb specific division processes by titrating components involved in Z-ring assembly (e.g., titrating FtsZ \citep{Zheng2016}).’

      Reviewer #2:

      This is an interesting paper which makes important contributions to an interesting and highly controversial topic: how does an E.coli cell decide when to divide.

      As the authors describe in clear and careful detail, two main camps have argued (often dogmatically) for "single process" models in which division is either a direct, downstream consequence of replication initiation (which is the regulated step) or of effects that act directly on division (irrespective of replication and, more generally, the chromosome cycle). The authors of this paper have, instead, proposed that both types of effects are important, in different proportions according to the circumstances. They refer to this idea as a "concurrent cycles" hypothesis. In previous work they have presented arguments and data which they interpret as being incompatible with any single process model and consistent with their alternative hypothesis.

      This work now investigates the consequences of treatment with A22, a drug which inhibits MreB, with the result that it increases cell width and, concomitantly, increases the length of time between completion of a given round of DNA replication and the immediately ensuing cell division (an interval known as the "D period"). The idea to analyze this situation was motivated by the authors previous hypothesis: by the concurrent cycles idea, increasing the length of the D-period should prolong the replication-independent inter-division process such that it becomes rate limiting in determining the timing of division (relative to the replication-dependent process).

      The data presented confirm the authors' expectation. They first show that progressively increasing the amount of A22 does not (dramatically) alter either: (i) the basic "adder" behavior in which a fixed amount of cell length is added irrespective of the length of the cell at birth or (ii) the finding that a fixed amount of cell length is added per replication origin during the period from one round of replication initiation to the next, which is consistent with (and generally considered to be supportive of) a role for a replication-dependent process.

      However, they also discover an interesting additional effect by examining the amount of cell length added (per origin) during the entire period comprising replication plus the immediately ensuing division ("C+D"). In the unperturbed case, cells that are longer at the time of initiation of replication also add more length during the ensuing (C+D) period. In contrast, in the presence of increasing amounts of A22, this effect is progressively reversed such that, finally, at high drug levels, cells which are longer (per origin) at the time of initiation of replication add much less length during the ensuing (C+D) period. Since the length of the C period is essentially constant in all conditions, the relevant effect is the variation in the length of the D period. And since the observed effect becomes more and more prominent with increasing A22 concentration, variation in the D period dominates more and more as the length of that period gets longer and longer. The authors interpret this effect to mean that, with increasing D-period length, division timing is decreasingly dependent on replication initiation. They go on to infer that "with increasing average D period, a process different from DNA replication is likely increasingly responsible for division control". This is a sensible, relatively formal restatement of the finding. This statement allows for diverse specific interpretations. The authors focus on one possible interpretation: they show that their previously proposed concurrent cycles hypothesis can quantitatively explain these data. In essence, given a replication-independent and a replication-dependent process, the observed findings are explained by an increased contribution of the replication-independent process. This scenario also does a better job of explaining the presented data, as well as other findings, than other recent "single process" models, for reasons that are discussed in straightforward detail in the Discussion. The authors also do an excellent job of laying out the assumptions upon which their model (and other existing models) are based, thus laying open the possibility for future studies to consider other possible scenarios.

      This work is important for four reasons. First, provides interesting new data which must be accommodated by any synthetic explanation for cell division control. Second, it makes it abundantly clear that the validity of any proposed single process model remains to be further substantiated. Third, it suggests an interesting alternative model which can accommodate a diversity of data, including that presented in the current work, and which has the potentially attractive feature of combining the two existing single-process models. Fourth, and perhaps most importantly, the authors discussion of the available data in this field clear, thoughtful and thought-provoking and leaves open the possibility of some as-yet unimagined mechanism. Overall, this work provides an important counterpoint to other published work and is a very valuable contribution to thinking and discussion in this field.

      [It can also be noted specifically that this work provides an important counterpoint to the model proposed in a previous eLIFE paper on this topic by Witz et al., 2019 (eLife 2019;8:e48063 doi: 10.7554/eLife.48063).]

      We thank the reviewer for her careful assessment and appreciation of our work.

      Reviewer #3:

      Colin, Micali et al. investigated slow-growing E. coli cells' division and replication over cell cycles at single cell level with the perturbed cellular dimension. They found that the time between replication termination and division increased by perturbing cell width as recently reported, and that chromosome replication became decreasingly limiting for cell division. These results well supported the 'concurrent-processes model' previously proposed by some of the authors.

      1) Cell length can be used to represent the cell size (adder) only if the cell width keeps constant. In the current form of the manuscript, it is unknown whether or not the cell width varies significantly at single-cell level with A22 treatment (e.g., 1µg/ml A22). In this case, cell volume might not be nicely correlated with cell length. The interpretation of Figure 3 therefore would be devalued.

      We now demonstrate in the new Figure 2–S2 that the coefficient of variations of cell width does not increase with A22 concentration (neither in snapshots from cells grown in liquid culture nore in the mother machine):

      Figure: Variation of width at the single-cell level. Coefficient of variation of cell width as a function of mean cell with. Squares and triangles represent measurements done on cells grown in mother machine or in liquid culture respectively. Blue color represents wild-type cells. Grey color represents cells treated with different amounts of A22.

      We also reference this figure in the main text, writing: ‘Increasing A22 concentration leads to increasing steady-state cell width both in batch culture and in the mother machine (Figure \ref{fig2}B), without affecting cell-to-cell width fluctuations (Figure \ref{CV_width}),} and without affecting doubling time (Figure \ref{fig2}C) or single-cell growth rate (Figure \ref{SI_Fig1}).’

      2) The negative value of 𝜁C+D in Figure 3F (treated group) indicates that the division length is negatively correlated with the cell length at replication initiation. It is not obvious that this can rule out the possible contribution of DNA replication/segregation in offsetting the length difference at initiation and thus contribute to cell division. Since Figure 3F is the key observation to validate the model, more explanations are required to help readers understand how a negative 𝜁C+D can lead to a conclusion that a process different from DNA replication is likely responsible for division control with A22-treatment.

      The negative value of zeta_CD actually corresponds to a lack of correlation between division size and size at initiation, typically predicted by the models where replication is never limiting for cell division (Micali et al 2018, Si et al. 2019). We have commented more explicitly on this point in the text, writing: ‘Note that the negative value of $zeta_{\rm CD}$ corresponds to a lack of correlation between division size and size at initiation (Figure \ref{fig3}G), typically predicted by the models where replication is never limiting for cell division~\cite{Micali2018,Micali2018b,Si2019}.’

      3) As an important input for the model, the QC+D' is assumed to be equal to QC+D in unperturbed conditions and remains constant regardless of the A22 concentration (Line 548-554). This assumption is reasonable if the minimum time interval for segregation (D') is irrelevant to the change of cell width. But how D' and QC+D' changes with cell width are unknown. Earlier molecular studies revealed that the polymerization of MreB affects the activity of topoisomerase IV, an enzyme mediates the dimerization of sister chromosomes, which implies that changing cell width may affect D'. Given the importance of QC+D' to the model, it is vital for the authors to make this assumption clear in maintext and explain why such assumption is reasonable.

      Q_CD’ (related to average growth in the CD’ period) is a parameter that we cannot measure, or bypass in the model. We have made this assumption more explicit in the text. While this question deserves further investigation in future studies, we know that D’ cannot increase too strongly with width, because otherwise it would leave replication/segregation limiting for division under A22 perturbations, contrary to our observation. This is the main reason to assume D’ constant in the model. A posteriori we can say that the loss of correlation between size at division and size at initiation observed under A22 treatment is in line with the hypothesis that D’ does not increase too much in order for the segregation process to interfere with cell division. We now write: ‘Note that neither the minimum completion time C+D' nor the coupling parameter $zeta_{CD’}$ can be measured experimentally, or bypassed in the model. In principle these parameters could change under A22 perturbations, since MreB affects the activity of topoisomerase IV \citep{madabhushi2009actin,kruse2003dysfunctional}, an enzyme that mediates the dimerization of sister chromosomes. However, constancy of $\zeta_{CD'}$ is supported by the constancy of the C period, and the minimum D' period cannot increase too strongly with width in the model, because otherwise it would render replication/segregation limiting for division under A22 perturbations, contrary to our experimental observation. Hence, for simplicity, we assumed $\zeta_{CD'}$ and the D' period to stay constant.’

    1. Author Response

      Reviewer #1 (Public Review):

      The authors push a fresh perspective with a sufficiently sophisticated and novel methodology. I have some remaining reservations that concern the actual make-up of the data basis and consistency of results between the two (N=16) samples, the statistical analysis, as well as the “travelling” part.

      I previously commented on the fact that findings from both datasets were difficult to discern and more effort should be made to highlight these. Also, a major conclusion “the directionality effect [effect of attention on forward waves] only occurs for visual stimulation” only rested on a qualitative comparison between studies. The authors have improved on this here, e.g., by toning down this conclusion. One thing that is still missing is a graphical representation of the data from Foster et al. (the second dataset analysed here) that would support the statistical results and allow the reader a visual comparison between the sets of findings.

      We are glad that the reviewer recognizes the improvement in the presentation of the conclusions. According to the suggestions, we have modified figure 2, not only by including a third dataset (see point below), but also in a way that allows a direct comparison between the three datasets. Specifically, the results from the three datasets are now shown in three columns next to each other. The first row shows the FW and BW waves in contra and ipsilateral lines of electrodes for each dataset: our dataset and the one from Feldmann-Wustefeld and colleagues (the first and the second column in the figure, both with visual stimulation) shows a clear interaction between direction and laterality, as confirmed by the statistical analysis. The dataset from Foster and colleagues (the third column, no visual stimulation) shows a laterality effect only in the backward waves but not in the forward ones, in line with the hypothesis that FW waves are modulated only in the presence of visual stimulation. The second row shows a schematic representation of the task, and the third row illustrate the electrodes’ lines used in each dataset. We hope the reviewer will be satisfied with the current data presentation.

      Also, for any naive reader, the concept of travelling waves may be hard to grasp in the way data are currently presented - only based on the results of the 2D-FFT. Can forward and backward-travelling waves be illustrated in a representative example to make this more intuitive?

      We thank the reviewer for the suggestion. We included in figure 1 an additional panel E that represents a schematic example of forward and backward waves in the temporal domain (i.e., in the EEG data). We hope this example will provide a better understanding of the data and the traveling wave concept.

      Finally, the way Bayes Factors from the Bayesian ANOVA are presented, especially with those close to the ‘meaningful boundaries’ ⅓ and 3, as defined in the ‘Statistical analysis’ section, requires some unification/revision. For example, here: “We found a positive correlation between contra- and ipsi- lateral backward waves, and occipital (all Pearson’s r~=0.4, all BFs 10 ~=3) and -to a smaller extent- frontal areas (all Pearson’s r~=0.3, all BFs 10 ~=2).”, where the second part should strictly be labelled as inconclusive evidence. In the same vein, there is occasional mention of “negative effects”, where it should say that evidence favours the absence of an effect.

      We agree with the reviewer and apologize for the inaccuracies in reporting the statistical analysis. We corrected as suggested (see below), replacing ‘negative effects’ with ‘evidence favors the absence of an effect’.

      From the updated manuscript :

      "We found moderate evidence of a positive correlation between contra- and ipsi- lateral backward waves, and occipital (all Pearson’s r~=0.4, all BFs10~=3) but inconclusive evidence in the frontal areas (all Pearson’s r~=0.3, all BFs10~=2)."

      From the revised ‘Results’ section, now it reads:

      […] whereas all other factors and their interactions revealed evidence in favor of the absence of an effect (BFs10<0.3).

      […] but not in the forward waves (BF10=0.231, error<0.01%, supporting evidence in favor of the absence of an effect).

      Reviewer #2 (Public Review):

      The present manuscript takes a new perspective and investigates the functional relevance of traveling alpha waves’ direction for visual spatial attention. While the modulation of alpha oscillatory power - and especially the lateralization of alpha power - has been associated with spatial attention in the literature, the present investigation offers a new perspective that helps understand and differentiate the functional roles of alpha oscillations in the ipsi- versus contralateral hemisphere for spatial attention.

      The present study uses a straightforward approach and provides an analysis of two EEG datasets, which are convergingly in line with the authors’ claim that two patterns of travelling alpha waves need to be differentiated in visual spatial attention. First, backward waves in the ipsilateral hemisphere, and second, forward waves in the contralateral hemisphere, which are only observed during visual stimulation. Importantly, the authors test the relation of these patterns of traveling waves to the overall power of alpha oscillations and to the hemispheric lateralization of alpha power. Furthermore, to test the functional significance, the authors demonstrate that the pattern of forward and backward waves around stimulus onset differentiates between hits and misses in task performance.

      Although the results are in line with the conclusions drawn, some questions remain. The authors investigate the relationship between traveling alpha waves and the hemispheric lateralization of alpha power, which is a well-established neural signature of spatial attention. Surprisingly, the lateralization of alpha power shown in Figure 3B appears relatively weak in the present dataset (by visual inspection), which raises the question of whether the investigation of a relation between lateralized alpha power and alpha traveling waves is warranted in the first place.

      We agree with the reviewer that the effect seems reduced compared to other studies, despite the topography of alpha-band lateralization in our data is in line with the literature. In order to quantify the effect, we performed an analysis similar to (Thut et al., 2006), defining a laterality index as:

      We computed such index for occipital electrodes and their average (in red in figure R1). The results reveal that for most electrodes, including their average, the laterality index is significantly larger than 0, confirming the presence of alpha-band lateralization. However, we also note that the amplitude of the effect (~0.04) is reduced compared to the study by Thut and colleagues, which was between 0.05 and 0.10.

      Figure R1 – Laterality index for occipital electrodes, quantifying alpha-band lateralization during attention allocation. All electrodes go in the expected direction, revealing an increase of alpha-band power in the ipsilateral occipital hemisphere.

      Furthermore, the authors employ between-subject correlations (with N = 16) to test the relationship between alpha traveling waves and (lateralized) alpha power. However, as inter- individual differences in patterns of travelling waves are not the main focus here, within- subject analyses of the same relations would be able to test the authors’ hypotheses much more directly.

      As suggested, we included the recommended within-subject analysis in the revised manuscript by computing a trial-by-trial correlation between alpha power and traveling waves for each participant. First, we obtained a correlation coefficient and a p-value for each subject. Then, we tested whether the correlation coefficients had an overall positive or negative distribution (i.e., according to our previous results, we expected a positive correlation between backward waves and alpha power). Additionally, we combined the p-values to test for overall significance (using the Fisher method, see Methods section below). Our results corroborate the between-subject correlation, supporting the conclusion that alpha-band power correlates mostly with backward waves (especially contro-lateral to the attended location). The other correlations (i.e., forward waves and alpha power) were statistically inconclusive. We included in the revised manuscript these new results, as shown in the following.

      From the Results section:

      “To further investigate the relation between alpha-band travelling waves and alpha power, we performed the same analysis focusing on the correlation within each participant. In particular, we correlated trial-by-trial forward and backward waves with alpha-band power for each subject, obtaining correlation coefficients ‘r’ and their respective p-values. As in the previous analysis, we correlated forward and backward waves with frontal and occipital electrodes in both contro- and ipsilateral hemispheres. We applied the Fisher method (Fisher, 1992, see Methods for details) to combine all subjects' p-values in every conditions. Overall, we found a significant effect of all combined p-values (p<0.0001), except in the lateralization condition (contra- minus ipsilateral hemisphere), similar to our previous analysis. Additionally, we tested for a consistent positive or negative distribution of the correlation coefficients. As shown in figure 3C, the results support a significant correlation between backward waves and alpha- power in the hemisphere contralateral to the attended location (BF10=10.7 and BF10=7.4 for occipital and frontal regions, respectively; all other BF10 were between 1 and 2, providing inconclusive evidence). Interestingly, this analysis also revealed a small but consistent effect in the correlation between lateralization effects, as we reported a consistently positive correlation in the contra- minus ipsilateral difference between forward waves and alpha power (BF10~5 for both frontal and occipital electrodes). However, it’s important to notice that the combined p-values obtained using the Fisher method did not reach the significance threshold in the lateralization condition, reducing the relevance of this specific result.“

      From the Methods section:

      “Additionally, we computed trial-by-trial correlations between waves and alpha power for all participants. First, we tested the correlation coefficient against zero in all conditions. Then, we obtained a combined p-value per condition using the log/lin regress Fisher method (Fisher, 1992), as shown in (Zoefel et al., 2019). Specifically, we computed the T value of a chi- square distribution with 2*N degrees of freedom from the pi values of the N participants as:

      It needs to be appreciated that the authors analyze two datasets in the present study. However, the question remains whether the absence of the forward waves effect in paradigms without visual stimulation is a general one and would replicate in other datasets. Moreover, the manuscript would benefit from a discussion of the potential implications of traveling waves for functional connectivity between posterior and anterior regions.

      We have now included a third dataset in the paper. In this dataset, from (Feldmann-Wüstefeld & Vogel, 2019), participants performed a visual working memory task by attending either the left or the right side of the screen where a stimulus was displayed. We analyzed the amount of waves during stimulus presentation, and we found the same results as in our own dataset: very strong evidence in favor of an interaction between LATERALITY (contra- and ipsilateral) and DIRECTION (FW and BW). We now included the results in figure 2 (see point above) and in the results section of the manuscript. Unfortunately, we couldn't find any other publicly available EEG dataset in which participants attend to either side of the screen without ongoing visual stimulation.

      In addition, we re-analyzed our main findings (i.e. the interaction between LATERALITY and DIRECTION) in all three datasets using a classic ANOVA to report the effect size as 𝜂2 (see point above). Unlike the Bayesian ANOVA (which -in JASP- is based on linear mixed models), the classic one does not model the slope of the random effects. Yet, we observed that the LATERALITY x DIRECTION interaction in the Foster dataset proved very significant, with a large effect size (F(1,16)=9.81, p=0.003, 𝜂2=0.13). Supposedly, modeling the slope of the random effects in the Bayesian ANOVA lowered its statistical sensitivity. For the sake of completeness, we reported both results in the manuscript.

      Concerning the potential implications of traveling waves on functional connectivity, we consider the interpretation based on the Predictive Coding scheme in the one before the last paragraph of the discussion (reported below for the reviewer’s convenience). In this framework, top-down connections have inhibitory functions, suppressing the predicted activity in lower regions. These interpretations align with our findings, relating the inhibitory role of backward travelling waves to visual attention. Similarly, in the same paragraph, we refer to the work of Spratling, which extensively investigates the relationship between selective attention and Predictive Coding.

      From the Results section:

      "To confirm our previous results, we replicated the same traveling waves analysis on two publicly available EEG datasets in which participants performed similar attentional tasks (experiment 1 of Foster et al., 2017 and experiment 1 of Feldmann-Wüstefeld and Vogel, 2019). In the first experiment from the Feldmann-Wüstefeld and Vogel dataset, participants were instructed to perform a visual working memory task in which, while keeping a central fixation, they had to memorize a set of items while ignoring a group of distracting stimuli. We focused our analysis on those trials in which the visual items to remember were placed either to the right or the left side of the screen, while the distractors were either in the upper or lower part of the screen (we pulled together the trials with either 2 or 4 distractors, as this factor was irrelevant for the purposes of our analysis). The stimuli were shown for 200ms, and we computed the amount of forward and backward waves in the 500ms following stimulus onset. As shown in figure 2 (central column), the analysis confirmed our previous results, demonstrating a strong interaction between the factors DIRECTION and LATERALITY (BF10=667, error~2%; independently, the factors DIRECTION and LATERALITY had BF10=0.2 and BF10=0.4, respectively). These results confirmed that, in the presence of visual stimulation, spatial attention modulates both forward and backward waves. Next, we analyzed another publicly available dataset from Foster et al., 2017. [...]"

      "Remarkably, as shown in figure 2 (right panel), our analysis demonstrated an effect of the lateralization (LATERALITY: BF10=3.571, error~1%), revealing more waves contralateral to the attended location, but inconclusive results regarding the interaction between DIRECTION and LATERALITY (BF10=2.056, error~1%). However, using a classical ANOVA (i.e., without modeling the slope of the random terms), the interaction between DIRECTION and LATERALITY proved significant (F(1,16)=9.81, p=0.003, 𝜂2=0.13)."

      From the Methods section:

      "We included two additional datasets in this study. In both studies, participants performed a visual attention task while keeping their fixation in the center of the screen. Regarding the Feldmann-Wüstefeld and Vogel, 2019 study, participants were asked to memorize the colors of two stimuli while ignoring a set of distractors stimuli. We analyzed uniquely those trials in which the visual stimuli were presented to the left or right side of the screen, while the distractors were placed above or below the fixation cross. After 500ms of the fixation cross, two colored 'target' stimuli were presented for 200ms. Participants were asked to memorize these stimuli, and a new 'probe’ stimulus was shown after an additional second. Participants reported whether the probe matched the target stimuli or not. We analyzed the traveling waves in the 500ms following the target stimulus onset. Participants performed a spatial attention task in the second dataset from Foster et al. 2017. First, the fixation cross cued participants to covertly attend one of eight possible spatial positions uniformly distributed around the center of the screen. After one second, a digit was displayed either in the cued location or in any other one. The remaining locations were filled with letters. Participants were instructed to report the only displayed digit. We analyzed the waves the second before the stimuli onset when participants attended to the locations cued to the left or right side of the screen (we discarded trials in which participants attended locations above or below the fixation cross). For additional details about both experimental procedures, we refer the reader to Foster et al., 2017 and Feldmann-Wüstefeld and Vogel, 2019.”

      From the discussion:

      "Our previous work proposed an alternative cause for the generation of cortical waves (Alamia and VanRullen, 2019). We demonstrated that a simple multi-level hierarchical model based on Predictive Coding (PC) principles and implementing biologically plausible constraints (temporal delays between brain areas and neural time constants) gives rise to oscillatory traveling waves propagating both forward and backward. This model is also consistent with the 2-dipoles hypothesis (Zhigalov and Jensen, 2022), considering the interaction between the parietal and occipital areas (i.e., a model of 2 hierarchical levels). However, dipoles in parietal regions are unlikely to explain the observed pattern of top-down waves, suggesting that more frontal areas may be involved in generating the feedback. This hypothesis is in line with the PC framework, in which top-down connections have an inhibitory function, suppressing the activity predicted by higher-level regions (Huang and Rao, 2011). Interestingly, Spratling proposed a simple reformulation of the terms in the PC equations that could describe it as a model of biased competition in visual attention, thus corroborating the interpretation of our finding within the PC framework (Spratling, 2008, 2012)."

    1. Author Response

      Reviewer #1 (Public Review):

      Point 1) There is affluent evidence that the cortical activity in the waking brain, even in head restrained mice, is not uniform but represents a spectrum of states ranging from complete desynchronization to strong synchronization, reminiscent of the up and down states observed during sleep (Luczak et al., 2013; McGinley et al., 2015; Petersen et al., 2003). Moreover, awake synchronization can be local, affecting selective cortical areas but not others (Vyazovskiy et al., 2011). State fluctuations can be estimated using multiple criteria (e.g., pupil diameter). The authors consider reduced glutamatergic drive or long-range inhibition as potential sources of the voltage decrease but do not attempt to address this cortical state continuum, which is also likely to play a role. For example: does the voltage inactivation following ripples reflect a local downstate? The authors could start by detecting peaks and troughs in the voltage signal and investigate how ripple power is modulated around those events.

      Our study is correlational, and hence, we cannot speak as to any casual role that the awake hippocampal ripples may play in the post-ripple hyperpolarization observed in aRSC. It is indeed possible that the post-awake-ripple neocortical hyperpolarization is independent of ripples and reflects other mechanisms that our experiments have possibly been blind to. One such mechanism is neocortical synchronization in the awake state. As reviewer 1 pointed out, it is possible that a proportion of hippocampal ripples occur before neocortical awake down-states. To test this hypothesis, we triggered the ripple power signal by the troughs (as proxies of awake down-states) and peaks (as proxies of awake up-states) of the voltage signals, captured from different neocortical regions, during periods of high ripple activity when the probability of neocortical synchronization is highest (McGinley et al., 2015; Nitzan et al., 2020). According to this analysis (see the figure below), the ripple power was, on average, higher before troughs of aRSC voltage signal than before those of other regions. On the other hand, the ripple power, on average, was not higher after the peaks of aRSC voltage signal than after those of other regions. This observation supports the hypothesis that a local awake down-state could occur in aRSC after the occurrence of a portion of hippocampal ripples. However, a recent work whose preprint version was cited in our submission (Chambers et al., 2022, 2021) reported that, out of 33 aRSC neurons whose membrane potentials were recorded, only 1 showed up-/down-states transitions (bimodal membrane potential distribution). Still, a portion (10 out of 30) of the remaining neurons showed an abrupt post-ripple hyperpolarization. In addition, they reported a modest post-ripple modulation of aRSC neurons’ membrane potential (~ %20 of the up/down-states transition range). Hence, these results suggest that the post-ripple aRSC hyperpolarization is not necessarily the result of down-states in aRSC. A paragraph discussing this point was added to the discussion lines 262-279.

      Mean ripple power triggered by troughs and peaks of voltage signal captured from aRSC, V1, and FLS1. Zero time represents the timestamp of neocortical troughs/peaks. The shading represents SEM (n = 6 animals).

      Point 2) Ripples are known to be heterogeneous in multiple parameters (e.g., power, duration, isolated events/ ripple bursts, etc.), and this heterogeneity was shown to have functional significance on multiple occasions (e.g. Fernandez-Ruiz et al., 2019 for long-duration ripples; Nitzan et al., 2022 for ripple magnitude; Ramirez-Villegas et al., 2015 for different ripple sharp-wave alignments). It is possible that the small effect size shown here (e.g. 0.3 SD in Fig. 2a) is because ripples with different properties and downstream effects are averaged together? The authors should attempt to investigate whether ripples of different properties differ in their effects on the cortical signals.

      The seeming small effect size (e.g. 0.3 SD in Fig. 2a) is because the individual peri-ripple voltage/glutamate traces were z-scored against a peri-non-ripple distribution and then averaged. Alternatively, the peri-ripple traces could have been averaged first, and the averaged trace could have been z-scored against a sampling distribution constructed from the abovementioned peri-non-ripple distribution where the sample size would have been the number of ripples detected for a specific animal. In the latter case, the standard deviation of the sampling distribution would have been used as the divisor in the z-scoring process as opposed to the former case where the standard deviation of the original peri-non-ripple distribution would have been used. Since the standard deviation of the sampling distribution is smaller than the standard deviation of the original distribution by a factor of √(sample size), the final z-scored values in the latter would be higher than those in the former case by a factor of √(sample size). For instance, if the sample size in Fig. 2A (number of ripples) was 100, the mean z-scored value would be 0.3*10 = 3. In any case, it is of interest to investigate the relationship between the ripple and neocortical activity features.

      To investigate the relationship between the hippocampal ripple power and the peri-ripple neocortical voltage activity, we focused on the agranular retrosplenial cortex (aRSC) as it showed the highest level of modulation around ripples. To get an idea of what features of the aRSC voltage activity might be correlated with the ripple power, the ripples were divided into 8 subgroups using 8-quantiles of their power distribution, and the corresponding aRSC voltage traces were averaged for each subgroup (similar to the work of Nitzan et al. (Nitzan et al., 2022)). The results of this analysis are summarized in the figure below.

      Left: peri-ripple aRSC voltage trace was triggered on ripples in the odd-numbered ripple power subgroups for each animal and then averaged across 6 animals. The standard errors of the mean were not shown for the sake of simplicity. Right: the same as the left panel but for only lowest and highest power subgroups. The shading represents the standard error of the mean.

      These results suggested that there might be a positive correlation between the ripple power and the pre-ripple and post-ripple aRSC voltage amplitude. To test this possibility, Pearson’s correlation between the ripple power and pre-/post-ripple aRSC amplitude was calculated for each animal separately. The ripple power for each detected ripple was defined as the average of the ripple-band-filtered, squared, and smoothed hippocampal LFP trace from -50 ms to +50ms relative to the ripple's largest trough timestamp (ripple center). The pre- and post-ripple aRSC amplitude for each ripple was calculated as the average of the aRSC voltage trace over the intervals [-200ms, 0] and [0, 200ms], respectively. The results come as follows.

      Top: the scatter plots of the ripple power and pre-ripple aRSC voltage amplitude for individual animals. The black lines in each graph represent the linear regression line. The blue circles in each graph are associated with one ripple. The Pearson’s correlation values (ρ) and the p-value of their corresponding statistical significance are represented on top of each graph. Bottom: the same as top graphs but for post-ripple aRSC amplitude.

      According to this analysis, 4 out of 6 animals showed a weak positive correlation (ρ = 0.0806 ± 0.0115; mean ± std), 1 animal showed a negative correlation (ρ = -0.20183), and 1 animal did not show a statistically significant correlation (p-value > 0.05) between ripple power and pre-ripple aRSC voltage amplitude. Moreover, 2 out of 6 animals showed a negative correlation (ρ = -0.1 and -0.14), and 4 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and post-ripple aRSC voltage amplitude.

      To check that the correlation results were not influenced by the extreme values of the ripple power and aRSC voltage, we repeated the same correlation analysis after removing the ripples associated with top and bottom %5 of the ripple power and aRSC voltage values. According to this analysis, 1 out of 6 animals showed a negative correlation (ρ = -0.13), and 5 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and pre-ripple aRSC voltage amplitude. Moreover, 2 out of 6 animals showed a negative correlation (same animals that showed negative correlation before removing the extreme values; ρ = -0.12 and -0.14), 1 animal showed a positive correlation (ρ = 0.1), and 3 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and post-ripple aRSC voltage amplitude.

      Based on these results, we cannot conclude that there is a meaningful correlation between the ripple power and amplitude of aRSC voltage activity before and after the ripples. It is noteworthy to mention that Nitzan et al. (see Fig S6 in (Nitzan et al., 2022)) did not report a statistically significant correlation between ripple power octile number (by discretizing a continuous-valued random variable into 8 subgroups) and pre-ripple firing rate of the mouse visual cortex. However, they reported a statistically significant negative correlation (ρ = -0.13) between the ripple power octile number and post-ripple firing rate of the mouse visual cortex. It appears that their reported negative correlation was influenced by the disproportionately larger values of the firing rate associated with the first ripple power octile compared to the other octiles. Therefore, repeating their analysis after removing the first octile would probably lead to a weak correlation value close to 0.

      Next, we investigated the relationship between ripple duration and aRSC voltage activity. To get an idea of what features of the aRSC voltage activity might be correlated with the ripple duration, the ripples were divided into 8 subgroups using 8-quantiles of their duration distribution, and the corresponding aRSC voltage traces were averaged for each subgroup. The results of this analysis are summarized in the figure below.

      Left: peri-ripple aRSC voltage trace was triggered on ripples in the odd-numbered ripple duration subgroups for each animal and then averaged across 6 animals. The standard errors of the mean were not shown for the sake of simplicity. Right: the same as the left panel but for only lower and highest duration subgroups. The shading represents standard error of the mean.

      These results do not reveal a qualitative difference between the patterns of aRSC peri-ripple voltage modulation and ripple duration. However, the same correlation analysis performed for the ripple power was also conducted for the ripple duration. Only 1 animal out of 6 showed a statistically significant correlation (ρ = 0.08) between pre-ripple aRSC voltage amplitude and ripple duration.

      Moreover, only 1 animal out of 6 showed a statistically significant correlation (ρ = -0.08) between post-ripple aRSC voltage amplitude and ripple duration. In conclusion, there does not seem to be a meaningful linear relationship between peri-ripple aRSC voltage amplitude and ripple duration.

      Next, we investigated whether the peri-ripple aRSC voltage modulation differs depending on whether a single or a bundled ripple occurs in the dorsal hippocampus. The bundled ripples were detected following the method described in our previous work (Karimi Abadchi et al., 2020). We found that 9.4 ± 3.5 (mean ± std across 6 animals) percent of the ripples occurred in bundles. Then, the aRSC voltage trace was triggered by the centers of the single as well as centers of the first/second ripples in the bundled ripples, averaged for each animal, and averaged across 6 animals. The results of this analysis are represented in the following figure.

      Left: animal-wise average of mean peri-ripple aRSC voltage trace triggered by centers of the single and centers of the first ripple in the bundled ripples. Right: Same to the left but triggered by the centers of the second ripple in the bundled ripples.

      These results suggest that the amplitude of aRSC voltage activity is larger before bundled than single ripples, and the timing of aRSC voltage activity is shifted to the later times for bundled versus single ripples. The pre-ripple larger depolarization might signal the occurrence of a bundled ripple (similar to larger pre-bundled- than pre-single-ripple deactivation observed during sleep (Karimi Abadchi et al., 2020)).

      Point 3) The differences between the voltage and glutamate signals are puzzling, especially in light of the fact that in the sleep state they went hand in hand (Karimi Abadchi et al., 2020, Fig. 2). It is also somewhat puzzling that the aRSC is the first area to show voltage inactivation but the last area to display an increase in glutamate signal, despite its anatomical proximity to hippocampal output (two synapses away). The SVD analysis hints that the glutamate signal is potentially multiplexed (although this analysis also requires more attention, see below), but does not provide a physiologically meaningful explanation. The authors speculate that feed-forward inhibition via the gRSC could be involved, but I note that the aRSC is among the two major targets of the gRSC pyramidal cells (the other being homotypical projections) (Van Groen and Wyss, 2003), i.e., glutamatergic signals are also at play. To meaningfully interpret the results in this paper, it would be instrumental to solve this discrepancy, e.g., by adding experiments monitoring the activity of inhibitory cells.

      Observing that glutamate and voltage signals do not go hand-in-hand in awake versus sleep states was surprising for us as well, and it was the main reason that SVD analysis was performed. Especially that a portion of aRSC excitatory neurons showed elevated calcium activity despite the reduction of voltage and delayed elevation of glutamate signals in aRSC at the population level. At the time of initial submission, pre-ripple reduction and post-ripple elevation of calcium activity in a portion of three subclasses of the superficial aRSC inhibitory neurons were reported (Chambers et al., 2022, 2021), and it was the basis of our speculation on the potential involvement of feed-forward inhibition in the post-ripple voltage reduction. We speculated that the source of this potential feed-forward inhibition could stem from gRSC excitatory neurons, as the reviewer 1 pointed out, or from other neocortical or subcortical regions projecting to aRSC. It is also possible that feedback inhibition would be involved where the principal aRSC neurons that are excited by gRSC (as reviewer 1 pointed out) or any other region, including aRSC itself, excite aRSC inhibitory neurons.

      Point 4) I am puzzled by the ensemble-wise correlation analysis of the voltage imaging data: the authors point to a period of enhanced positive correlation between cortex and hippocampus 0-100 ms after the ripple center but here the correlation is across ripple events, not in time. This analysis hints that there is a positive relationship between CA1 MUA (an indicator for ripple power) and the respective cortical voltage (again an incentive to separate ripples by power), i.e. the stronger the ripple the less negative the cortical voltage is, but this conclusion is contradictory to the statements made by the authors about inhibition.

      A closer look at Figure 2B iv reveals that elevation of the cross-correlation function between peri-ripple aRSC voltage and hippocampal MUA starts with a short delay (~20 ms) and peaks around 75 ms after the ripple centers. It means the maximum correlation between the two signals occurs at point (75ms, 75ms) on the MUA time-voltage time plane whose origin (i.e. the point (0, 0)) is the ripple centers in the hippocampal MUA and corresponding imaging frame in the voltage signal. Reviewer 1’s interpretation would be correct if the maximum correlation occurred at the point (0, 0) not at the point (75ms, 75 ms). It is because the MUA value at the time of ripple centers (t = 0) is the indicator of the ripple power not at the time t = 75ms. Figure 2B iii shows that the amplitude of hippocampal MUA is more than 2 dB less at t = 75ms than at t = 0 which is a reflection of the fact that ripples are often short-duration events. Instead, if the maximum correlation occurred at the point (0, 100ms) where the ripples had maximum power and aRSC voltage was at its trough (Figure 2B iii), it could have been concluded that “the stronger the ripple the less negative the cortical voltage”.

      Point 5) Following my previous point, it is difficult to interpret the ensemble-wise correlation analysis in the absence of rigorous significance testing. The increased correlation between the HPC and RSC following ripples is equal in magnitude to the correlation between pre-ripple HPC MUA and post-ripple cortical activity. How should those results be interpreted? The authors could, for example, use cluster-based analysis (Pernet et al., 2015) with temporal shuffling to obtain significant regions in those plots. In addition, the authors should mark the diagonal of those plots, or even better compute the asymmetry in correlation (see Steinmetz et al., 2019 Extended Fig. 8 as an example), to make it easier for the reader to discern lead/lag relationships.

      The purpose of calculating the ensemble-wise correlation coefficient was to provide further information about the relationship between the two random processes peri-ripple HPC MUA and peri-ripple neocortical activity. In general, the correlation between the two random processes cannot be inferred from the temporal relationship between their mean functions. In other words, there are infinitely many options for the shape of the correlation function between two random processes with given mean functions. Moreover, the point was to compare the correlation of peri-ripple neocortical activity and HPC MUA across neocortical regions. The fact that mean peri-ripple activity in, for example, RSC and FLS1 are different does not necessarily mean their correlation functions with peri-ripple HPC MUA are also different.

      As requested, we performed cluster-based significant testing via temporal shuffling for each individual VSFP (n = 6), iGluSnFR Ras (n = 4), and iGluSnFR EMX (n = 4) animals. The following figures summarize the number of animals showing significant regions in their correlation functions between peri-ripple HPC MUA and different neocortical regions. The diagonal of the correlation functions is marked; however, the temporal lead/lag should not be inferred from these results mainly because the temporal resolution of the two signals, one electrophysiological and one optical, are not the same.

      Point 6) For the single cell 2-photon responses presented in Fig. 3, how should the reader interpret a modulation that is at most 1/20 of a standard deviation? Was there any attempt to test for the significance of modulation (e.g., by comparing to shuffle)? If yes, what is the proportion of non-modulated units? In addition, it is not clear from the averages whether those cells represent bona fide distinct groups or whether, for instance, some cells can be upmodulated by some ripples but downmodulated by others. Again, separation of ripples based on objective criteria would be useful to answer this question.

      As explained in response to point 2, the seeming small modulation size (e.g. 0.05 SD in Fig. 3b) is because the individual peri-ripple calcium traces were z-scored against a peri-non-ripple distribution and then averaged. Alternatively, the peri-ripple traces could have been averaged first, and the averaged trace could have been z-scored against a sampling distribution constructed from the abovementioned peri-non-ripple distribution where the sample size would have been the number of ripples detected for a specific animal. In this latter case, the standard deviation of the sampling distribution would have been used as the divisor in the z-scoring process as opposed to the former case where the standard deviation of the original peri-non-ripple distribution would have been used. Since the standard deviation of the sampling distribution is smaller than that of the original distribution by a factor of √(sample size), the final z-scored values in the latter would be higher than those in the former case by a factor of √(sample size).

      As suggested by the reviewer and to make our results more comparable with those of electrophysiological studies, we deconvolved the calcium traces and tested for the significance of the modulation of each neuron by comparing its mean peri-ripple deconvolved trace with a neuron-specific shuffled distribution (see the methods section for details). We found %8.46 ± 3 (mean ± std across 11 mice) of neurons were significantly modulated over the interval [0, 200ms] and %81.08 ± 8.91 (mean ± std across 11 mice) of which were up-modulated. If the criterion of being distinct is being significantly up- or down-modulated, these two groups could be considered distinct groups. The following figures show mean peri-ripple calcium and deconvolved traces, averaged across up- or down-modulated neurons for each mouse and then averaged across 11 mice.

      Point 7) Fig. 3: The decomposition-based analysis of glutamate imaging using SVD needs to be improved. First, it is not clear how much of the variance is captured by each component, and it seems like no attempt has been made to determine the number of significant components or to use a cross-validated approach. Second, the authors imply that reconstructing the glutamate imaging data using the 2nd-100th components 'matches' the voltage signal but this statement holds true only in the case of the aRSC and not for other regions, without providing an explanation, raising questions as to whether this similarity is genuine or merely incidental.

      The first 100 components explained about %99.9 of the variance in the concatenated stack of peri-ripple neocortical glutamate activity for each animal which is practically equivalent to the entire variance in the data. Our goal was not to obtain a low-rank approximation of the data for which the number of significant components had to be determined. Instead, we decomposed the data into the activity along the first principal component for which there was no noticeable topography among neocortical regions and the activity along the rest of the components for which there was a noticeable topography among neocortical regions. The first component explained %83.11 ± 6.75 (mean ± std across 4 iGluSnFR Ras mice) and %83.3 ± 5.07 (mean ± std across 4 iGluSnFR EMX mice) of variance in the concatenated stack of peri-ripple neocortical glutamate activity.

      As we discussed in the discussion section of the manuscript, SVD is agnostic about brain mechanisms and only cares about capturing maximum variance. Specifically, it is not designed to capture the maximum similarity between glutamate and voltage activity in the brain. Therefore, the only thing we can say with certainty comes as follows: when the activity along the axis with maximum co-variability (1st principal component) across the neocortical regions’ glutamate activity is removed, only aRSC, and no other regions, show a post-ripple down-modulation, whose timing matches that of aRSC post-ripple voltage down-modulation. Moreover, the timing of activity of 1st principal component matches better with that of calcium activity among the up-modulated portion of aRSC neurons. Even though the genuineness of these results is not guaranteed, the similarity between the timing of SVD output in aRSC glutamatergic activity with that in two independently collected signals in aRSC, i.e. voltage and calcium, could support the idea that peri-ripple aRSC glutamatergic activity is likely a mixture of up- and down-modulated components.

      Point 8) The estimation of deep pyramidal cells' glutamate activity by subtracting the Ras group (Fig. 4d) is not very convincing. First, the efficiency of transgene expression can vary substantially across different mouse lines. Second, it is not clear to what extent the wide field signal reflects deep cells' somatic vs. dendritic activity due to non-linear scattering (Ma et al., 2016), and it is questionable whether a simple linear subtraction is appropriate. The quality of the manuscript would improve substantially if the authors probe this question directly, either by using deep layer specific line/ 2-P imaging of deep cells or employing available public datasets.

      Simulation studies have suggested that the signal, captured by wide-field imaging of voltage-sensitive dye, can be modeled as a weighted sum of voltage activity across neocortical layers (Chemla and Chavane, 2010; Newton et al., 2021). Hence, modeling the glutamate signal as a weighted sum of the glutamate activity across neocortical layers is a good starting point. Future studies would be needed to improve this starting point by imaging glutamate activity in a cohort of mice with iGluSnFR expression in only deep layers’ neurons. Moreover, Ma et al. (Ma et al. 2016) stated that “This means that signal detected at the cortical surface (in the form of a two-dimensional image) represents a superficially weighted sum of signals from shallow and deeper layers of the cortex”.

      Reviewer #2 (Public Review):

      Point 1) The authors throughout the manuscript compare the correlation between hippocampal MUA and the imaged cortical ensemble activity (Example: Lines 120-122). There is a potential time lag in signal detection with regard to the two detection methods. While the time lag using electrophysiological recording is at the scale of milliseconds, the glutamate-sensitive imaging might take several 100s of ms to be detected. It is not clear in the manuscript how the authors considered this problem during the analysis.

      The ensemble-wise correlation analysis characterizes the relationship between two random processes, peri-ripple HPC MUA and peri-ripple neocortical activity (please see the response to reviewer 1’s major point 5). Although it is a valid point that the temporal resolution of the two signals is not the same which could introduce an error in the exact timing of the relationship between the two processes, we did not draw any conclusion based on the exact timing of the elevated correlation between the two processes. Moreover, we smoothed (equivalent to low-pass filtering) and down-sampled the MUA signal (please see the methods section) to bring the temporal scale of the two processes closer to each other. We also want to clarify that the temporal resolution of voltage and glutamate imaging is in the range of 10s of ms (Xie et al., 2016).

      Point 2) In the results section "The peri-ripple glutamatergic activity is layer dependent", are the Ras and EMX expressed in two different experimental animal groups? If yes, and there was a time lag between the two groups, is it valid to estimate the deeper layer activity using a scaled version of the Ras from the EMX signal?

      This comment is addressed in response to reviewer 1’s major point 8.

      Point 3) The authors did not discuss the results adequately in the discussion section. Since there is no behavioral paradigm and no behavioral read-out to induce or correlate it with possible planning and future decision-making process, the significance of the paper will be enhanced by discussing the possible underlying circuitry mechanism that might cause the reported observations. With no planning periods in the task (instead just sitting on a platform), it is actually quite unclear what the purpose of wake ripples should be. For example, the authors discuss the superficial and deep layer responses and their relation to the memory index theory. However, the RSC possesses different groups of excitable neurons in different layers. Specifically, three excitable neurons are found within the different layers of the RSC; the intrinsically bursting neurons (IB), regular spiking (RS), and low-rheobase (LR) neurons. These neurons are distributed heterogeneously within the RSC cortical layer. Although the RS are abundant in the deeper layers of the RSC, they occupy 40% of the total amount of excitable neurons found in layers II/III. On the other hand, the LR is the dominant excitable neuron in the superficial layers. It will add to the significance of the work if the authors discussed the results in the context of the cellular structure of the RSC and how would that impact the observed inhibition in the peri-ripple time window. It would be helpful for the readers and the reviewers to add a schematic diagram to the discussion section.

      The goal of our study was to characterize the patterns of neocortical activity around hippocampal ripples in the awake state and not shed light on the function (purpose) of awake ripples. However, we speculated about what our results could mean in the discussion section. To address the reviewer’s comment on the differences across RSC layers, the following paragraph was added to the discussion section lines 342-353.

      “Our results suggest that dendrites of deep pyramidal neurons, arborized in the superficial layers of the neocortex, receive glutamatergic modulation earlier than those of the superficial ones. However, the results do not provide a mechanistic explanation of the phenomenon. It is possible that the observed layer-dependency of the glutamatergic modulation would partially result from the heterogeneity of the excitatory as well as inhibitory neurons across aRSC layers. But, the question is how this heterogeneity may lead to the above-mentioned layer-dependency to which our data does not provide an answer. It could be speculated that the difference in the dendritic morphology and firing type of different types of RSC excitatory neurons (Yousuf et al., 2020) or the difference in connectivity of different RSC layers with other brain regions would play a role (Sugar et al., 2011; van Groen and Wyss, 1992; Whitesell et al., 2021). This is a complicated problem and could only be resolved by conducting experiments specifically designed to address this problem.”

      Point 4. A general issue (in addition to the missing behaviour), is the mix of the methods. On one side this makes the article very interesting since it highlights that with different methods you actually observe different things. But on the other side, it makes it very difficult to follow the results. It would be a major improvement of the article if the authors could include (as mentioned above) a schematic of the results and their theory, especially highlighting how the different methods would capture different parts of the mechanism. Finally, the authors should not use calcium signals as a direct measure of neuronal firing. Calcium influx is only seen in bursts of firing, not with individual spikes. It is a plasticity signal and therefore should be treated and discussed as such. Just recently it was shown by Adamantidis lab that the calcium signal changes between wake and sleep and this change does not parallel changes in neuronal firing/spikes.

      We agree with the reviewer that the calcium signal is biased toward burst of spikes (Huang et al., 2021). To address this concern, the term “spiking activity” was replaced with “calcium activity” throughout the manuscript. Moreover, the calcium signal was deconvoled to get a better estimate of the spiking activity (please refer to our response to the reviewer 1’s point 6).

      Point 5. In the discussion section, the authors focus their discussion on the connectivity between the CA1 area and the RSC. Although it is an important point, since the authors are examining the peri-ripple cortical dynamics, it is critical to discuss other possible connectivity effects. Furthermore, the hippocampal input preferentially targets the granular RSC, how would that impact the results and the interpretation of the authors? Additionally, a previous study reported the suppression of the thalamic activity during hippocampal ripples (Yang et al., 2019). Importantly, the thalamic inputs to the RSC target the superficial layers. It will add to the value of the paper if the authors expanded the discussion section and elaborated further on the possible interpretation of the results.

      At the time of our initial submission, pre-ripple reduction and post-ripple elevation of calcium activity in a portion of three subclasses of the superficial aRSC inhibitory neurons were reported (Chambers et al., 2022, 2021), and it was the basis of our speculation on the potential involvement of feed-forward inhibition in the post-ripple voltage reduction. We speculated that the source of this potential feed-forward inhibition could stem from gRSC excitatory neurons or other neocortical or subcortical regions projecting to aRSC (please see the discussion section). However, the source being from the thalamus is less likely because multiple studies have observed the suppression of the majority of thalamic neurons during awake ripples (Logothetis et al., 2012; Nitzan et al., 2022; Yang et al., 2019). Moreover, peri-awake-ripple suppression of thalamic axons projecting to the first layer of aRSC is reported (Chambers et al., 2022). On the other hand, it is also possible that feedback inhibition would be involved where the excitatory aRSC neurons that are excited by gRSC (as reviewer 1 pointed out) or any other region, including aRSC itself, excite aRSC inhibitory neurons which in turn inhibit pyramidal cells. To address this comment, the following paragraph was added to the discussion section in lines 323-328.

      “Thalamus is another source of axonal projections to aRSC (Van Groen and Wyss, 1992). However, it is less likely that thalamic projections contribute to the peri-awake-ripple aRSC activity modulation because multiple studies have observed the suppression of the majority of thalamic neurons during awake ripples (Logothetis et al., 2012; Nitzan et al., 2022; Yang et al., 2019). Moreover, peri-awake-ripple suppression of thalamic axons projecting to the first layer of aRSC is reported (Chambers et al., 2022).”

    1. Author reponse

      Reviewer #1 (Public Review):

      In their paper, Kroell and Rolfs use a set of sophisticated psychophysical experiments in visually-intact observers, to show that visual processing at the fovea within the 250ms or so before saccading to a peripheral target containing orientation information, is influenced by orientation signals at the target. Their approach straddles the boundary between enforcing fixation throughout stimulus presentation (a standard in the field) and leaving it totally unconstrained. As such, they move the field of saccade pre-processing towards active vision in order to answer key questions about whether the fovea predicts features at the gaze target, over what time frame, with what precision, and over what spatial extent around the foveal center. The results support the notion that there is feature-selective enhancement centered on the center of gaze, rather than on the predictively remapped location of the target. The results further show that this enhancement extends about 3 deg radially from the foveal center and that it starts ~ 200ms or so before saccade onset. They also show that this enhancement is reinforced if the target remains present throughout the saccade. The hypothesized implications of these findings are that they could enable continuity of perception trans-saccadically and potentially, improve post-saccadic gaze correction.

      Strengths:

      The findings appear solid and backed up by converging evidence from several experimental manipulations. These included several approaches to overcome current methodological constraints to the critical examination of foveal processing while being careful not to interfere with saccade planning and performance. The authors examined the spatial frequency characteristics of the foveal enhancement relative, hit rates and false alarm rates for detecting a foveal probe that was congruent or incongruent in terms of orientation to the peripheral saccade target embedded in flickering, dynamic noise (i/f )images. While hit rates are relatively easy to interpret, the authors also reconstructed key features of the background noise to interpret false alarms as reflecting foveal enhancement that could be correlated with target orientation signals. The study also - in an extensive Supplementary Materials section - uses appropriate statistical analyses and controls for multiple factors impacting experimental/stimulus design and analysis. The approach, as well as the level of care towards experimental details provided in this manuscript, should prove welcome and useful for any other investigators interested in the questions posed.

      Weaknesses:

      I find no major weaknesses in the experiments, analyses or interpretations. The conclusions of the paper appear well supported by the data. My main suggestion would be to see a clearer discussion of the implications of the present findings for truly naturalistic, visually-guided performance and action. Please consider the implication of the phenomena and behaviors reported here when what is located at the gaze center (while peripheral targets are present), is not a noisy, relatively feature-poor, low-saliency background, but another high-saliency target, likely crowded by other nearby targets. As such, a key question that emerges and should be addressed in the Discussion at least is whether the fovea's role described in the present experiments is restricted to visual scenarios used here, or whether they generalize to the rather different visual environments of everyday life.

      This is a very interesting question. While we cannot provide a definite answer, we have added a paragraph discussing the role of foveal prediction in more naturalistic visual contexts to the Discussion section (‘Does foveal prediction transfer to other visual features and complex natural environments?’). We pasted this paragraph in response to another comment in the ‘Recommendations for the authors’ section below. We suggest that “the pre-saccadic decrease in foveal sensitivity demonstrated previously[9] as well as in our own data (Figure 2B) may boost the relative strength of fed-back signals by reducing the conspicuity of foveal feedforward input”, presumably allowing the foveal prediction mechanism to generalize to more naturalistic environments with salient foveal stimulation.

      Reviewer #2 (Public Review):

      Human and primates move their eyes with rapid saccades to reposition the high-resolution region of the retina, the fovea, over objects of interest. Thus, each saccade involves moving the fovea from a pre-saccadic location to a saccade target. Although it has been long known that saccades profoundly alter visual processing at the time of saccade, scientists simply do not know how the brain combines information across saccades to support our normal perceptual experience. This paper addresses a piece of that puzzle by examining how eye movements affect processing at the fovea before it moves. Using a dynamic noise background and a dual psychophysical task, the authors probe both the performance and selectivity of visual processing for orientation at the fovea in the few hundred milliseconds preceding a saccade. They find that hit rates and false alarm rates are dynamically and automatically modulated by the saccade planning. By taking advantage of the specific sequence of noise shown on each trial, they demonstrate that the tuning of foveal processing is affected by the orientation of the saccade target suggesting foveal specific feedback.

      A major strength of the paper is the experimental design. The use of dynamic filtered noise to probe perceptual processing is a clever way of measuring the dynamics of selectivity at the fovea during saccade preparation. The use of a dual-task allows the authors to evaluate the tuning of foveal processing as well and how it depends on the peripheral target orientation. They show compellingly that the orientation of the saccade target (the future location of the fovea) affects processing at the fovea before it moves.

      There are two weaknesses with the paper in its current form. The first is that the key claim of foveal "enhancement" relies on the tuning of the false alarms. A more standard measure of enhancement would be to look at the sensitivity, or d-prime, of the performance on the task. In this study, hits and false alarms increase together, which is traditionally interpreted as a criterion shift and not an enhancement. However, because of the external noise, false alarms are driven by real signals. The authors are aware of this and argue that the fact that the false alarms are tuned indicates enhancement. But it is unclear to me that a criterion shift wouldn't also explain this tuning and the change in the noise images. For example, in a task with 4 alternative choices (Present/Congruent, Present/Incongruent, Absent/Congruent, Absent/Incongruent), shifting the criterion towards the congruent target would increase hits and false alarms for that target and still result in a tuned template (because that template is presumably what drove the decision variable that the adjusted criterion operates on). I believe this weakness could be addressed with a computational model that shows that a criterion shift on the output of a tuned template cannot produce the pattern of hits and false alarms.

      We thank the reviewer for this comment. We will present three arguments, each of which suggests that our effects are perceptual in nature and cannot be explained by a shift in decision criterion: (1) the temporal specificity of the difference in Hit Rates (HRs), (2) the spatial specificity of the difference in HRs and (3) the phenomenological quality of the foveally predicted signal. In general, a criterion shift would indeed affect hits and false alarms alike. Nonetheless, the difference in HRs only manifested under specific and meaningful conditions:

      First, the increase in congruent as compared to incongruent HRs, i.e., enhancement, was temporally specific: congruent and incongruent HRs were virtually identical when the probe appeared in a baseline time bin or one (Figure 2B) or even two (Figure 4A) early pre-saccadic time bins. Based on another reviewer’s comment, we collected additional data to measure the time course and extent of foveal enhancement during fixation. While pre-saccadic enhancement developed rapidly, enhancement started to emerge 200 ms after target onset during fixation. Crucially, these time courses mirror the typical temporal development of visual sensitivity during pre-saccadic attention shifts and covert attentional allocation, respectively[8,33]. We are unaware of data demonstrating similar temporal specificity for a shift in decision criterion. One could argue that a template of the target orientation needs to build up before it can influence criterion. Nonetheless, this template would be expected to remain effective after this initial temporal threshold has been crossed. In contrast, we observe pronounced enhancement in medium but not late stages of saccade preparation in the PRE-only condition (Figure 4A).

      Second, it has been argued that a defining difference between innately perceptual effects and post-perceptual criterion shifts is their spatial specificity[53]: in opposition to perceptual effects, criterion shifts should manifest in a spatially global fashion. Due to a parafoveal control condition detailed in our reply to the next comment, we maintain the claim that enhancement is spatially specific: congruent HRs exceeded incongruent ones within a confined spatial region around the center of gaze. We did not observe enhancement for probes presented at 3 dva eccentricity even when we raised parafoveal performance to a foveal level by adaptively increasing probe contrast. The accuracy of saccade landing or, more specifically, the mean remapped target location (Figure 3B) influenced the spatial extent of the enhanced region in a fashion that is reconcilable with previous findings[30]. A criterion shift that is both spatially and temporally selective, follows the time course of pre-saccadic or covert attention depending on observers’ oculomotor behavior, does not remain effective throughout the entire trial after its onset, is sensitive to the mean remapped target location across trials, and does not apply to parafoveal probes even after their contrast has been increased to match foveal performance, would be unprecedented in the literature and, even if existent, appear just as functionally meaningful as sensitivity changes occurring under the same conditions.

      Lastly and on a more informal note, we would like to describe a phenomenological percept that was spontaneously reported by 6 out of 7 observers in Experiment 1 and experienced by the author L.M.K. many times. On a small subset of trials, participants in our paradigms have the strong phenomenological impression of perceiving the target in the pre-saccadic center of gaze. This percept is rare but so pronounced that some observers interrupt the experiment to ask which probe orientation they should report if they had perceived two on the same trial (“The orientation of the normal probe or of the one that looked exactly like the target”). Interestingly, the actual saccade target and its foveal equivalent are perceived simultaneously in two spatiotopically separate locations, suggesting that this percept cannot be ascribed to a temporal misjudgment of saccade execution (after which the target would have actually been foveated). We have no data to prove this observation but nonetheless wanted to share it. Experiencing it ourselves has left us with no doubt that the fed-back signal is truly – and almost eerily – perceptual in nature.

      The analysis suggested by the reviewer is very interesting. Yet for several reasons stated in the ‘Suggestions to the authors’ section, our dataset is not cut out for an analysis of noise properties at this level of complexity. We had always planned to resolve these concerns experimentally, i.e., by demonstrating specificity in HRs. We believe that our arguments above provide a strong case for a perceptual phenomenon and have incorporated them into the Discussion of our revised manuscript.

      The second weakness is that the author's claim that feedback is spatially selective to the fovea is confounded by the fact that acuity and contrast sensitivity are higher in the fovea. Therefore, the subject's performance would already be spatially tuned. Even the very central degree, the foveola, is inhomogeneous. Thus, finding spatially-tuned sensitivity to the probes may simply indicate global feature gain on top of already spatially tuned processing in the fovea. Another possible explanation that is consistent with the "no enhancement" interpretation is that the fovea has increased. This is consistent with the observation that the congruency effects were aligned to the center of gaze and not the saccade endpoint. It looks from the Gaussian fits that a single gain parameter would explain the difference in the shape of the congruent and incongruent hit rates, but I could not figure out if this was explicitly tested from the existing methods. Additional experiments without prepared saccades would be an easy way to address this issue. Is the hit rate tuned when there is no saccade preparation? If so, it seems likely that the spatial selectivity is not tuned feedback, but inhomogeneous feedforward processing.

      We fully agree. We do not consider a fixation condition diagnostic to resolve this question since, as of now, correlates of foveal feedback have exclusively been observed during fixation. In those studies, it was suggested that the effect, i.e., a foveal representation of peripheral stimuli, reflects the automatic preparation of an eye movement that was simply not executed[11,12,14]. To address another reviewer’s comment, we collected additional data in a fixation experiment. The probe stimulus could exclusively appear in the screen center (as in Experiment 1) and observers maintained fixation throughout the trial. While pre-saccadic congruency effects were significantly more pronounced and developed faster, congruency effects did emerge during fixation when the probe appeared 200 ms after the target. If pre-saccadic processes indeed spill over to fixation tasks to some extent and trigger relevant neural mechanisms even when no saccade is executed, we could expect a similar feedback-induced spatial profile during fixation. Since this matches the reviewer’s prediction if the pre-saccadic profiles resulted from inhomogeneous feedforward processing, we do not consider a fixation condition suitable to distinguish between both hypotheses.

      To test whether the tuning of enhancement is effectively a consequence of declining visual performance in the parafovea/periphery, we instead raised parafoveal performance to a foveal level by adaptively increasing the opacity of the probe: while leaving all remaining experimental parameters unchanged, we presented the probe in one of two parafoveal locations, i.e., 3 dva to the left or right of the screen center. Observers were explicitly informed about the placement of the probe. We administered a staircase procedure to determine the probe opacity at which performance for parafoveal target-incongruent probes would be just as high as foveal performance had been in the preceding sessions. While the foveal probe was presented at a median opacity of 28.3±7.6%, a parafoveal opacity of 39.0±11.1% was required to achieve the same performance level. As a result, the gray dot at 0 dva in the figure below represents the incongruent HR in the center of gaze and ranges at 80% on the y-axis. The gray dots at ±3 dva represent incongruent parafoveal HRs and also range at ~80% on the y-axis. Using the reviewer’s terminology, we effectively removed the influence of acuity- (or contrast-sensitivity-) dependent spatial tuning. If the spatial profiles had indeed been the result of “global feature gain on top of already spatially tuned processing“, this manipulation should render parafoveal feature gain just as detectable as foveal feature gain. Instead, congruent and incongruent parafoveal HRs were statistically indistinguishable (away from the saccade target: p = .127, BF10 = 0.531; towards the saccade target: p = .336, BF10 = 0.352), inconsistent with the idea of a spatially global feature gain.

      We had included these data in our initial submission. They were collected in the same observers that contributed the spatial profiles (Experiment 2). The data points at 0 dva in the reduced figure above correspond to the foveal probe location in Figure 2D. The data points at ±3 dva had been plotted and discussed in our initial submission, yet only very briefly. Based on this and another reviewer’s comment, we realize that we should have explained this condition more extensively in the main text rather than in the Methods and have added a dedicated paragraph to the Results section.

      This paper is important because it compellingly demonstrates that visual processing in the fovea anticipates what is coming once the eyes move. The exact form of the modulation remains unclear and the authors could do more to support their interpretations. However, understanding this type of active and predictive processing is a part of the puzzle of how sensory systems work in concert with motor behavior to serve the goals of the organism.

      Reviewer #3 (Public Review):

      This manuscript examines one important and at the same time little investigated question in vision science: what happens to the processing of the foveal input right before the onset of a saccade. This is clearly something of relevance as humans perform saccades about 3 times every second. Whereas what happens to visual perception in the visual periphery at the saccade goal is well characterized, little is known about what happens at the very center of gaze, which represents the future retinal location where the saccade target will be viewed at high resolution upon landing. To address this problem the authors implemented an elegant experiment in which they probed foveal vision at different times before the onset of the saccade by using a target, with the same or different orientation with respect to the stimulus at the saccade goal, embedded in dynamic noise. The authors show that foveal processing of the saccade target is initiated before saccade execution resulting in the visual system being more sensitive to foveal stimuli which features match with those of the stimuli at the saccades goal. According to the authors, this process enables a smooth transition of visual perception before and after the saccade. The experiment is well designed and the results are solid, overall I think this work represents a valuable contribution to the field and its results have important implications. My comments below:

      1. The change in the overall performance between the baseline condition and when the probe is presented after the saccade target is large, but I wonder if there are other unrelated factors that contribute to this difference, for example, simply presenting the probe after vs before the onset of a peripheral stimulus, or the fact that in the baseline the probe is presented right after a fixation marker, but in the other condition there was a longer time interval between the presentation of the marker and the probe transient. The authors should discuss how these confounding factors have been accounted for.

      We thank the reviewer for this helpful comment. We would like to clarify that the probe was never presented right after the fixation dot. In the baseline condition, fixation dot and target were separated by 50 ms, i.e., the duration of one noise image. Since the fixation dot was an order of magnitude smaller than the probe (0.3 vs 3 dva in diameter) and since two large-field visual transients caused by the onset of a new background noise image occurred between fixation dot disappearance and probe appearance, we consider it unlikely that the performance difference was caused by any kind of stimulus interaction such as masking. Nonetheless, we had been puzzled by this difference already when inspecting preliminary results and wondered if it may reflect observers’ temporal expectations about the trial sequence. We therefore explicitly instructed and repeatedly reminded observers that the probe could appear before the peripheral target. Since the difference persisted, we ascribed it to a predictive remapping of attention to the fovea during saccade preparation, as we had stated in the Discussion.

      Another contributing factor may be that observers approached the oculomotor and perceptual detection tasks sequentially. In early trial phases, they may have prioritized localizing the target and programming the eye movement. After motor planning had been initiated, resources may have been freed up for the foveal detection task. Since on the majority of probe-present trials, the probe appeared after the saccade target, this strategy would have been mostly adaptive. Crucially, however, observers yielded similar incongruent Hit Rates in the baseline and last pre-saccadic time bin (70% vs 74%). While we observed pronounced enhancement in the last pre-saccadic bin, congruent and incongruent Hit Rates in the baseline bin were virtually identical. We therefore conclude that lower overall performance in the baseline bin did not prevent congruency effects from occurring. Instead, congruency effects started developing only after target appearance. We have added this potential explanation to the Results.

      1. Somewhat related to point 3, the authors conclude that the effects reported here are the result of saccade preparation/execution, however, a control condition in which the saccade is not performed is missing. This leaves me wondering whether the effect is only present during saccade preparation or if it may also be present to some extent or to its full extent when covert attention is engaged, i.e when subjects perform the same task without making a saccade.

      Foveal feedback has, as of now, exclusively been demonstrated during fixation (see references in Introduction and Discussion). In most of these studies, it was suggested that these effects (i.e., the foveal representation of a peripheral stimulus) may reflect the automatic preparation of an eye movement that was simply not executed[11,12,14]. Since foveal feedback has been demonstrated during fixation, and since eye movement preparation may influence foveal processing even when the eyes remain stationary, we considered it likely that congruency effects would emerge during fixation. Nonetheless, we agree with the reviewer that an explicit comparison between saccade preparation and fixation would enrich our data set and allow for stronger conclusions. We therefore collected additional data from seven observers. While all remaining experimental parameters were identical to Experiment 1, observers maintained fixation throughout each trial. We found that pre-saccadic foveal enhancement was more pronounced and emerged earlier than foveal enhancement during fixation. We present these data in the Results section (Figure 5) and have updated the Methods section to incorporate this additional experiment. We have furthermore added a paragraph to the Discussion which addresses potential mechanisms of foveal enhancement during fixation and saccade preparation.

      Furthermore, the reviewer’s comment helped us realize that we never stated a crucial part of our motivation explicitly. We now do so in the Introduction:

      “Despite the theoretical usefulness of such a mechanism, there are reasons to assume that foveal feedback may break down while an eye movement is prepared to a different visual field location. First and foremost, saccade preparation is accompanied with an obligatory shift of attention to the saccade target[6-8] which in turn has been shown to decrease foveal sensitivity[9]. Moreover, the execution of a rapid eye movement induces brief motion signals on the retina[20] which may mask or in other ways interfere with the pre-saccadic prediction signal. On a more conceptual level, the recruitment of foveal processing as an ‘active blackboard’[21] may become obsolete in the face of an imminent foveation of relevant peripheral stimuli – unless, of course, foveal processing serves the establishment of trans-saccadic visual continuity.”

      We believe that the additional data and the revisions to the Introduction and Discussion have strengthened our manuscript and thank the reviewer for this comment.

      1. Differently from other tasks addressing pre-saccadic perception in the literature here subjects do not have to discriminate the peripheral stimulus at the saccade goal, and most processing resources are presumably focused at the foveal location. Could this have influenced the results reported here?

      This is true. We intentionally made the features of the peripheral target as task-irrelevant as possible, contrary to previous investigations. We wanted to ensure that the enhancement we find would be automatic and not induced by a peripheral discrimination task, as we state in the Discussion and the Methods. We agree that the foveal detection task likely focused processing resources on the center of gaze in Experiment 1. In Experiment 2, however, we measured the spatial profile of enhancement which involved two different conditions:

      1. In each observer’s first six sessions, the probe could be presented anywhere on a horizontal axis of 9 dva length. On a given trial, an observer could not predict where it would appear, and therefore could not strategically allocate their attention. Nonetheless, enhancement of target-congruent orientation information was tuned to the fovea.
      2. In the final, seventh session, the probe appeared exclusively in one of two possible peripheral locations: 3 dva to the left or 3 dva to the right of the screen center. Observers were explicitly informed that the probe would never appear foveally, and processing resources should therefore have been allocated to the peripheral probe locations. The general performance level in this condition was comparable to performance in the fovea (see reply to the next comment). Nonetheless, we did not find peripheral enhancement of target-congruent information.

      Importantly, the magnitude of the foveal congruency effect in the PRE-only condition of Experiment 1 (i.e., when the target disappeared before the eyes landed on it) was comparable to the foveal congruency effect in Experiment 2 (PRE-only throughout), suggesting that the format of the task – i.e., purely foveal detection or foveal and peripheral detection – did not alter our findings.

      1. The spatial profile of the enhancement is very interesting and it clearly shows that the enhancement is limited to a central region. To which extent this profile is influenced by the fact that the probe was presented at larger eccentricities and therefore was less visible at 4.5 deg than it was at 0 deg? According to the caption, when the probe was presented more eccentrically the performance was raised to a foveal level by adaptively increasing probe transparency. This is not clear, was this done separately based on performance at baseline? Does this mean that the contrast of the stimulus was different for the points at +- 3 dva but the performance was comparable at baseline? Please explain.

      Based on the previous comment and comments of Reviewer #2, we realize that we should have explained this condition more extensively in the main text rather than in the Methods and have adapted the manuscript accordingly. As stated in our reply to the previous comment, Experiment 2 involved one session in which we addressed whether the lack of parafoveal/peripheral enhancement could be due to a simple decrease in acuity as mentioned by the reviewer. Observers were explicitly informed that the to-be detected stimulus (the probe) would appear either 3 dva to the left or right but never in the screen center and were shown slowed-down example trials for illustration. Observers then performed a staircase procedure which was targeted at determining the probe contrast at which performance for parafoveal target-incongruent probes would be just as high as foveal performance for target-incongruent probes had been in the previous six sessions. While the foveal probe was presented at a median opacity of 28.3±7.6%, an opacity of 39.0±11.1% was required to achieve the same performance level at a 3 dva eccentricity. Therefore, the gray curve in Figure 2D that represents incongruent Hits reaches its peak just under 80% on the y-axis. The gray dots at ±3 dva also range at ~80% on the y-axis. The performance level for target-incongruent probes (‘baseline’ here) in the parafovea is thus equal to foveal performance for target-incongruent probes. Target-congruent parafoveal feature information had the same “chance” to be enhanced as foveal information in the preceding sessions. Despite an equation of performance, we found no parafoveal enhancement. This suggests that enhancement is a true consequence of visual field location and not simply mediated by visual acuity at that location.

      1. The enhancement is significant within a region of 6.4 dva around the center of gaze. This is a rather large region, especially considering that it extends also in the direction opposite to the saccade. I was expecting the enhancement to be more confined to the central foveal region. Was the effect shown in Figure 2D influenced by the fact that saccades in this task were characterized by a large undershoot (Fig 1 D)? Did the effect change if only saccades landing closer to the target were included in the analysis? There may not be enough data for resolving the time course, but maybe there are differences in the size of the main effect.

      Width of the profile: In general, the width of the enhancement profile is likely to be influenced by two experimental/analysis choices: the size of the probe stimulus presented during the experiment and the width of the moving window combining adjacent probe locations for analysis.

      Probe size: Since the probe itself had a comparably large diameter of 3 dva, even the leftmost significant point at -2.6 dva could be explained by an enhancement of the foveal portion of the probe. We had mentioned this briefly in the Discussion but realize that this point is crucial and should be made more explicit. Moving window width: We designed the experiment with the intention to densely sample a range of spatial locations during data collection and combine a certain number of adjacent locations using a moving window during analysis (see preregistration: https://osf.io/6s24m). To ensure the reliability of every data point, the width of this window was chosen based on how many trials were lost during preprocessing. We chose a window width of 7 locations as this ensured that each data point contained at least 30 trials on an individual-observer level. Nonetheless, the width of the resulting enhancement profile depends on the width of the moving window:

      We added these caveats to the Results section and incorporated the figure above into the Supplements. We now state explicitly that…

      “the main conclusions that can be drawn are that enhancement i) peaks in the center of gaze, ii) is not uniform throughout the tested spatial range as, for instance, global feature-based attention would predict, and iii) is asymmetrical, extending further towards the saccade target than away from it.”

      For the above reasons, the absolute width of the profile should be interpreted with caution.

      Saccadic landing accuracy: To address the reviewer’s question, we inspected the spatial enhancement profile separately for trials in which the saccade landed on the target (i.e., within a radius of 1.5 dva from its center) or off-target but still within the accepted landing area. This trial separation criterion, besides appearing meaningful, ensured that all observers contributed trials to every data point. We had never resolved the time course in this experiment and could therefore not collapse across time points as suggested by the reviewer. To increase the number of trials per data point, we instead increased the width of the moving window sliding across locations from 6 to 9 neighboring locations (but see caveat above).

      Considering only saccades that landed on the target (‘accurate’; A) yielded significant enhancement from -2.6 to 2.1 dva and from 3.2 dva throughout the measured range towards the saccade target. Saccades that landed off-target (‘inaccurate’; B) showed a more pronounced asymmetry. When only considering inaccurate saccades, enhancement reached significance between -1.1 and 4.4 dva.

      The increased asymmetry for inaccurate saccades may be related to predictive remapping: since inaccurate saccades were hypometric on average, the predictively remapped location of the target was shifted towards the target by the magnitude of the undershoot. Asymmetric enhancement would therefore have boosted congruency at the remapped target location across all trials. In consequence, we inspected if aligning probe locations to the remapped target location on an individual-trial level would lead to a narrower profile for inaccurate saccades. This was not the case. Instead, we observed two parafoveal maxima (C). Their position on the x-axis equals the mean remapping-dependent leftwards (2.0 dva) and rightwards (1.9 dva) displacement across trials. In other words, they correspond to the pre-saccadic center of gaze. Note that these profiles could not be fitted with a mixture of Gaussians and were fitted using polynomials instead.  

      In sum, while we do not observe a clear narrowing of the enhancement profile for accurate saccades, the profile’s asymmetry is more pronounced for inaccurate eye movements. An increase in asymmetry could bear functional advantages since it would boost congruency at the remapped target location across all trials. Importantly though, this adjustment seems to rely on an estimate of average rather than single-trial saccade characteristics: aligning probe locations to the remapped attentional locus on an individual trial level provides further evidence that, irrespective of individual saccade endpoints, enhancement was aligned to the fovea. We have added these analyses to the Results section (Figure 3). We have also added the remapped profiles for all saccades and accurate saccades only to the Supplements.

      1. Is the size of the enhanced region around the center of gaze related to the precision of saccades? Presumably, if saccades are less precise a larger enhanced area may be more beneficial.

      This is a very interesting point. To address this question, we estimated each observer’s saccadic precision by computing bivariate kernel densities from their saccade landing coordinates. As we measured the horizontal extent of enhancement in our experiment, we defined the horizontal bandwidth as an estimate of saccadic imprecision. To estimate the size of the enhanced region for each observer, we created 10,000 bootstrapping samples for each observer’s congruent and incongruent HRs (4 locations combined at each step) We then determined the difference between the bootstrapped congruent and incongruent HRs and defined significantly enhanced locations as all locations for which <= 5% of these differences fell below zero. We then defined the width of the enhancement profile as the maximum number of consecutive significant locations.

      Instead of a positive correlation, we observed a negative correlation between the bandwidth of landing coordinates (i.e., saccadic imprecision) and the size of the enhanced window (r = -.56, p = .117). In other words, there was a non-significant tendency that the less precise an observer’s saccades, the narrower their estimated region of enhancement. We furthermore inspected the magnitude of enhancement per position within in the enhanced region. To do so, we computed the mean difference between congruent and incongruent HR across all positions in the enhanced region. The sizes of the orange circles in the figure above represent the resulting values (ranging from 2.9% to 13.3%). As saccadic precision decreases, the magnitude of enhancement per data point in the enhanced region tends to decrease as well. We therefore suggest that high saccadic precision is a sign of efficient oculomotor programming, which in turn allows peri-saccadic perceptual processes to operate more effectively. We added this analysis to the Supplements and refer to it in the Results section of the revised manuscript.

    1. Author Response

      Reviewer #1 (Public Review):

      HCN channels are atypically opened by the downward movement of gating charges during hyperpolarisation and have such weak coupling between the VSD and pore domain, and in the absence of an open state structure, extracting mechanistic information has been difficult. This manuscript is a continuation of a previous study on HCN channel gating that revealed how hyperpolarisation causes a downward movement of the VSD's S4, with breakage into two helices. The authors explore gating motions and the coupling between VSD and the pore domain using atomistic simulations. This includes microsecond MD with and without very strong -1V applied potentials to try to drive VSD-TMD changes to open the channel. In the end, however, the authors used a biased simulation approach (adiabatic bias) to enforce conformational change from resting to an open homology model of HCN based on hERG/rEAG. This microsecond simulation followed three interaction distances that were suggested to change between resting and open states based on free MD. This simulation caused pore opening and allowed a description of changes that may occur during gating, including a competition of S5-S6 and S6-S6 contacts and lipid binding locations, which may suggest lipid-dependent function and explain an unexpected closed structure at 0mV in micelles. While I feel the manuscript is written for the HCN expert audience, the mechanistic information in terms of hyperpolarisation-induced voltage gating makes it of much interest. The manuscript is presented at a high level, though there are a couple of points to address, including reproducibility of simulations and potential for more relation to experimental findings.

      We appreciate the comments, thank you, please find a detailed answer below.

      The authors carried out 1μs-MD simulations of the resting, activated, and a Y289D mutant at 0 mV, and then tried to drive the conformational change with a very large -1V voltage (double that studied previously). In 1 us MD, is the membrane stable with such a big voltage, as it would likely not be experimentally? Even with a volt applied, there was incomplete activation of the voltage sensors, despite timescales approaching that of activation.

      This reviewer is correct in cautioning against membrane rupturing effects in simulations with a voltage of this magnitude. We have indeed checked that the membrane and the protein remains intact under these conditions and can confirm that no poration occurs. As membrane poration is stochastic, it could indeed occur over microsecond timescales under 1V, but the probability remains low, and we were lucky to not face this situation herein. Note that whereas potentials of this magnitude could not be applied in experiments, they are relatively routinely used in MD simulations to speed up processes that are driven by changes in transmembrane potentials.

      Interestingly, other work from our lab (Rems et al. Biophysical Journal 119 (1) 190-205 (2020)) has shown that HCN1 voltage sensor domains are less prone to poration than those from other voltage sensor domains, for reasons that remain to be determined.

      Author Response Figure 1. Final snapshots from the simulations of the resting (blue), intermediate (yellow) and activated (red) states. The representation of the solvent (water+ions) in cyan showed no membrane poration at the end of the 1us simulations.

      For the pulling/ driving simulations (adiabatic bias MD) to change suspected interaction distances (V390-I302, N300-W281, and D290-K412), it seems to be just 1 simulation, without reproducibility. One has to wonder, if the simulation was redone from a very different initial conformation, would the results be the same (in addition to the distances themselves that were enforced by the ABMD). Moreover, the authors had to model the open state, such that the results depend on a homology model based on other CNBD channels, hERG / rEAG. Although the model stayed open for a microsecond, what other measures of accuracy of the homology model are there, such as preserved distances according to mutants/double mutants?

      The ABMD simulations were repeated, please refer to the response to essential revisions point 1 for details.

      For reasons mentioned by the reviewer as well as a reconsideration of our strategy to model channel opening, we have decided to omit homology models from the revised version of the paper.

      The authors find that activation involves hydrophobic forces that strengthen the intra-subunit S4/S5/S6 interface, as well as lipid headgroups that make contact with hydrophilic residues at this interface, with lipid tails also contributing to hydrophobic contacts. The authors see bending and rotation of the lower S4 and a displacement of S1 away from S4 that exposes the VSD-pore interface to lipids, with increased lipid contacts at S4 and S5 during activation. This indicates lipid tails may play a role in coupling in HCN1 and may explain the closed state micelle structure at 0mV. Two sites of lipid contact are identified, one engaging VSD residues and the other polar or charged residues on S5 and S6. No experiments are presented or proposed to test the predicted lipid sites. e.g. Mutation of key residues, such as the arginine and histidine seen binding lipid headgroups could be tested as proof of their involvement, or perhaps experiments with varied phosphate moieties? In the absence of new experiments, is there existing data that could help validate the findings?

      We thank this reviewer for this comment. As noted in the response to essential revisions point 3, such experiments are challenging, and have not been reported so far in HCN channels. We do agree that aspects of the mechanism we propose remain hypothetical awaiting further work, but are happy to report that importance of lipid interactions with the crucial salt bridge pair mentioned in the response to essential revisions point 3 has been completely independently validated, thus strengthening our mechanistic hypothesis substantially.

      During free MD simulation, the authors see tilting of S5 caused by activation of the Y289D mutation that brings D290 and K412 positions into proximity. How do we know that the adjacent mutant of Y289 to aspartate has not caused this, or was this interaction also seen in wild-type simulation? Fig.3c might suggest the wt activated simulation may see such an interaction, but it is unclear given the large C_alpha distances, as opposed to H-bonding distances.

      Indeed, Figure 3 appears to indicate that this interaction between D290 and K412 is present in the activated state when the mutation is reverted to the WT sequence. We have recalculated the interaction propensity using all atoms of the residues and present an updated Figure 3c in response.

      The authors predict that a D290-K412 salt bridge may be important for gating and sought to experimentally validate the interaction in the activated-open state using cysteine cross-bridging. As this is the only experimental backing in the paper, it is important to be able to judge its ability to report on the D290-K412 salt bridge. A comparison experiment demonstrating other crosslinks that do not favour the open state would have been helpful in this regard e.g. if crossbridging at similar locations (but not predicted to change interaction during gating) had little effect on I/Imax, then the result may be bolstered. Are there existing mutagenesis experiments that may suggest the importance of these residues (as well as for other key interaction distances identified)?

      Negative results in cross bridging and cysteine accessibility studies in general are difficult to interpret as the lack of a cadmium-specific effect may be due to inaccessibility of the site to cadmium, pairwise distance too far to bridge by cadmium, or bridging or the specified site without a functional effect. However, as reviewer 2 pointed out below, the Yellen group has performed extensive cross bridging experiments in the S4-S5 to Clinker region in spHCN and in most of these positions, the pairs favoring the open state are closer together in our models than pairs favoring the closed state or those without functional effect. We have added Videos 1-6 to highlight this comparison on our open state models and describe in our updated discussion section.

      Rotation of the V390 side chain from a position facing the pore lumen to a position facing I302 on S5 is coupled to an increase of the pore radius at V390, an increased hydration of the pore intracellular gate, and K+ ion movement. Perhaps 5 or 6 ions cross in that single simulation. As K channel ion permeation can depend critically on starting ion configs (as well as the model/force field), reproducibility of this finding is important but does not appear to have been tested. How can we be sure that periods of permeation or no permeation in individual simulations are reliable?

      As mentioned in our response to essential revisions point 1, we have modified the collective variable set used in ABMD, and repeated the simulations in 4 replicates. Whereas the number of permeation events is low in each simulation (Figure 4 S1), the consistency across repeats indicates that these open pore models indeed represent conductive states. Given how short the simulations are, however, it appears unreasonable to infer conductance values from these observations.

      Reviewer #3 (Public Review):

      In this work, Elbahnsi and colleagues use enhanced sampling MD simulation, to recapitulate step by step, the electromechanical coupling between VSD and the pore in HCN1 channels. Building on the available cryoEM structures of HCN1 with the VSD in resting and active state, the authors characterize by MD a subset of interactions that seemingly stabilize the open channel. This subset is, in turn, used in enhanced-sampling simulations to guide channel opening. The main findings are that S4 movement induces a rearrangement of the hydrophobic interaction at the level of S1- S4- and S5 interfaces. Occupancy of lipids seems therefore statedependent and highlights their regulatory role in HCN gating.

      The approach is rather innovative, and it apparently allows the reconstruction of the whole mechanism of gating, pushing the predictive power of MD simulation well beyond its actual temporal limitations. At the same time, the initial choice of interactions is crucial for this approach, because the result cannot differ from the inputs. And reading the paper it does not emerge clearly how the correctness of the reconstructed gating pathway can be verified, if not by functional validation.

      We thank the reviewer for this thoughtful review. It has pushed us to reconsider our approach to enhance the sampling of channel activation and gating. Please refer to the detailed response below as well as the response in particular to essential revisions point 1.

      Here are my comments on the main interactions that were used to feed the final MD simulation:

      1) W281-N300: this interaction, previously identified and studied in SpH channels (Ramentol et al, 2020; Wu et al, 2021), has been elegantly confirmed in this paper. Its inclusion in the initial subset seems appropriate. In the other two cases, the choice of interactions requires further explanations and experimental validation.

      2) D290 and K412: the validation of this interaction shown in Figure 3 and suppl Figure 1 is missing a control, i.e., the effect of the addition of Cd++ on the wt channel. Please add.

      We have performed the control suggested. Please also refer to the answer to essential revisions point 2.

      3) Modelling the open state of HCN1 pore (page 18), is done on the structure of the distantly related hERG rather than on the available open pore structure of HCN4. This choice is justified as follows by the authors:

      a) "Available structures in the CNBD channel family for which representative structures have been solved in closed and open states".

      b) "The structural mechanism of pore gating (i.e. the ⍺ to 𝜋 helix occurring at the glycine657 hinge in hERG) observed in rEAG/hERG may be a conserved gating transition in the CNBD family of channels"

      I encourage the authors to consider the following:

      a) The structure of hERG channel is not available in the closed/open configuration, indeed the comparison must be done with the closed configuration of the related channel rEAG. On the contrary, HCN4 is available in the closed/open configurations. Moreover, one of the open pore structures shows S4-S5-S6 in a very similar conformation to the lock open mutant (F186C/S264C) of HCN1 (Saponaro et al, 2021). With an available HCN4 open structure, forcing HCN1 to the open pore structure of hERG channel (which opens in depolarization and is not regulated by cAMP) seems not necessary.

      In response to this point, we reconsidered our approach and chose to instead use a biasing distance that is consistently increased in CNBD channels of resolved structures, that between neighboring and cross-subunits V390. We have detailed our rationale in the response to essential revisions point 1.

      To my knowledge, hERG is the only channel of the CNBD family for which the transition ⍺ to 𝜋 helix reported by the Authors, occurs in S6. It is not reported for other CNBD family members, in particular for the CNG channels mentioned by the Authors (Zheng et al., 2020; Xue et al., 2021, 2022). Task 4 (Zheng et al) does not show it. Its pore opens by a right-handed twist of S6 at glycine 399, a conserved glycine in all CNG. Human CNGA1 too, opens the pore by a rotational movement of S6 hinged at the equivalent glycine (glycine 385) (Xue et al, 2021). And the same occurs in the non-symmetrical channel CNGA1/B1 (Xue te al, 2022). So, it seems that CNG channels do not show the ⍺ to 𝜋 helix transition in the open pore. Moreover, hERG excluded, all other members of the CNBD family, CNG, EAG, and HCN4 included, do not bend at the hinge glycine 657 of hERG, but at another glycine (gly 648 in hERG numbering) located upstream. Further, their opening is due to a rotation of S6 associated with an outward movement, rather than to the lifting of the lower part of S6, as in hERG.

      After considering this reviewer’s comment, we were surprised to see that HCN1 is apparently prone to secondary structure deformation in S6, even when biasing the aforementioned distances, and thus enforcing no rotation at all in S6. We are intrigued by this observation and eagerly await experimental validation or disproval.<br /> In the meantime, we have made clear in the text that this hypothesis remains based exclusively on modeling work.

      4) V390-I302: this interaction is predicted to stabilize the open pore configuration and was included in the subset. The contact between V390 on S6 and I302 on S5 is observed in the homology model discussed above when the S6 is twisted at the glycine hinge, rotating the preceding residue (V390) out of its pore-lining position and is. Again, I can only disagree with this hypothesis because it has been experimentally demonstrated (Cheng et al, J Pharmacol Exp Ther. 2007 Sep;322(3):931-9) that the side chain of Valine390 is inside the cavity of the open pore of HCN1 channels as it controls the affinity for the pore blocker ZD7288.

      In accordance with other comments above, we have eliminated the bias applied to the V390I302 distance. However, the new ABMD simulations with bias applied to encourage dilation at position 390 still involve rotation of V390 away from the central pore axis, albeit with bending of S6 at the upper glycine mentioned by this reviewer. The degree of rotation is lower than in our previous simulations so that V390 still lines the inner vestibule in the open state, consistent with the observation that this position influences the apparent affinity of open pore blockers.

      In conclusion, modelling the open state pore of HCN1 on hERG rather than on that of HCN4 seems not justified based on accumulated evidence in the published literature. Therefore, the choice of the authors to use it as the open pore model of HCN1 channels needs to be experimentally validated. One possibility is to mutate the glycine hinge, gly391 in HCN1, into an Alanine in order to remove the flexible hinge. If this mutation alters pore gating, it will support the choice of the Authors.

      Once more, we thank the reviewer for the comments, which have led us to reconsider a larg part of our modeling work.

    1. Author Response

      Reviewer #2 (Public Review):

      There is emerging evidence that connexin43 hemichannels localized to mitochondria can influence their function. Here the authors demonstrated using an osteocyte cell model that connexin43 is localized to mitochondria and that this is enhanced in response to oxidative stress. Several lines of evidence were presented showing that mitochondrial connexin43 forms functional hemichannels and that connexin43 is required for optimal mitochondrial respiration and ATP generation. These aspects were major strengths of the study.

      The authors also show that connexin43 is recruited to mitochondria in response to oxidant stress, as a cell protective mechanism. This was primarily done using hydrogen peroxide to generate oxidant stress; primary osteocytes from Csf-1+/- mice, which are prone to Nox4 induced oxidant stress, also show enhanced mitochondrial connexin43 when compared with wild type osteocytes.

      Several approaches were used to demonstrate that connexin43 interacts with the ATP synthase subunit, ATP5J2, suggesting a direct role for connexin43 in the control of ATP synthesis by mediating mitochondrial ion homeostasis. Several experiments were done using a series of pHluorin fusion protein constructs as a proton sensor, these experiments hint at a potential role for connexin43 in regulating H+ permeability to support ATP production. However, the effects of inhibiting connexin43 on pH were modest, suggesting that additional roles for mitochondrial connexin43 in ATP generation should be considered.

      Thank you for your positive and thoughtful comments. We agree that additional roles for mitochondrial Cx43 may be possible. As an example, we consider that there may be a change in the stability of ATP synthase that occurs after mtCx43 deficiency. This and other possible roles of mtCx43 ought to be investigated in the future.

      Reviewer #3 (Public Review):

      This manuscript should be of broad interest to readers not only in the field of gap junction (GJ) mediated cell-to-cell communication but also to scientists and clinicians working on the function of mitochondria and metabolism. Their data elucidates a new function of Cx43 in regulating the energy (ATP) generation of mitochondria, e.g., under oxidative stress.

      The canonical function of gap junctions is in direct cell-to-cell communication by forming plasma membrane traversing channels that electrically and chemically connect the cytoplasms of adjacent cells. These channels are assembled from connexin proteins, connexin 43 (Cx43). However, more recently new, non-canonical cellular locations and functions of Cx43 have been discovered, e.g. mitochondrial Cx43 (mtCx43). However, very little is known about where Cx43 transported into mitochondria is derived from, how Cx43 is transported into mitochondria, where it is located in mitochondria, in which form Cx43 is present in mitochondria, (polypeptides, hemi-channels (HCs), complete GJ channels), and what the function of mtCx43 is. The authors addressed the latter question. The authors provide convincing evidence that mtCx43 modulates mitochondrial homeostasis and function in bone osteocytes under oxidative stress. Together, their study suggests that mtCx43 hemi-channels regulate mitochondrial ATP generation by mediating K+, H+, and ATP transfer across the mitochondrial inner membrane by directly interacting with mitochondrial ATP synthase (ATP5J2), leading to an enhanced protection of osteocytes against oxidative insult. These findings provide important information of a role of Cx43 functioning directly in mitochondria and not at the canonical location in the plasma membrane. While most of the functional assays presented in Figures 2-8 appear solid, the mitochondrial localization of Cx43, its translocation into mitochondria under oxidative stress, and its configuration as hemi-channels (Figure 1) is less convincing. I have five general comments that should be addressed:

      1) This study was performed in MLO-Y4 osteocyte cells. Is the H2O2 induced increase of mitochondrial Cx43 MLO-Y4 cell type or osteocyte specific, or is Cx43 playing a more general role in mitochondrial function, e.g. under oxidative stress? Osteoblasts such as MC3T3-E1 and MG63, and many other cell types endogenously express Cx43, and oxidative stress is a general physiological stressor, not only for osteocytes and bone cells. Attending to this question would address the generality of the findings for mitochondrial function.

      We thank the reviewer for bringing up these valid points; seeing the phenotype displayed in secondary cell types, such as osteoblasts, would be of great relevance and interest. To address this, we conducted new experiments on MC3T3-E1 cells (Figure 1-figure supplement 2). After 2 hrs of H2O2 treatment, Cx43 accumulated on the mitochondria, marked by Mitotracker. Statistical analysis also showed a significant increase of the localization between Cx43 and Mitotracker (Figure 1-figure supplement 2B). The colocalization coefficient is higher in the Ctrl group in MC3T3-E1 cells when compared with the MLO-Y4 Ctrl group, indicating a different response level in other cell lines. Osteoblasts seemed to be more sensitive to redox interference. Overall, proving the point that under oxidative stress, mtCx43 may display a similar phenotype, across multiple cell lines, although the degree of sensitivity may differ.

      2) The images of MLO-Y4 cells (Figure 1A) and the primary osteocytes isolated from Csf-1+/- and control mice (Figure 8) do not show visible gap junctions. I guess this is due to the fact that slides were stained with the Cx43(E2) antibody. I feel, staining of these cells in addition with the Cx43(CT) antibody would be helpful to get a better understanding on the distribution of Cx43 in gap junctions and undocked/un-oligomerized Cx43 in these cells.

      Thank you for the suggestion. To get a better understanding of the distribution of Cx43, either in GJ or HC form, we performed additional experiments in MLO-Y4 cells using the Cx43(CT) antibody and data are shown below. With Cx43(CT) staining, we observed more signals in the cells and on the plasma membrane. After H2O2 treatment, we observed increased and stronger signals localized on the mitochondria compared with the untreated control group. Stronger signals observed in the plasma membrane indicate the gap junction stained by Cx43(CT) antibody.

      3) The images of cells presented in Figure 1A are quite fussy. No mitochondria are visible, and the Cx43 staining is hazy and does not localize to any subcellular structures. Also, it is not clear if the higher resolution image presented in Figure 1C actually represents a mitochondrion. A good DIC image, or co-staining with another mitochondrial marker such as MitoTracker (as shown in Figure 4-S1) would make the localization and translocation of Cx43 into mitochondria upon oxidative stress more convincing. This is especially important as the translocation, although statistically significant, increases only by about 10% or less (Figure 1B). Such a small difference (also represented in the Western analyses presented in Figure 1D) could easily be artefactual, depending on how the correlation coefficient was generated. Of note in this respect is that control cells in Figure 1A appear larger (compare the size of the nuclei) and are spread out more than the H2O2 treated cells. Better, more clear images would make the mitochondrial localization/translocation more convincing.

      The reviewer made great points. To improve the image clarity, we redid the staining/imaging and determined the colocalization of SDHA and MitoTracker Deepred. The result (shown below) suggested that under normal conditions without H2O2 treatment, SDHA and MitoTracker merged perfectly, while after H2O2 treatment for 2 hrs, mitochondria became fragmented and the SDHA signal exhibited a more dotted pattern compared to the MitoTracker. Overall, we feel that MitoTracker represents the distribution of mitochondria better. SDHA is a subunit of mitochondrial complex II, and the images we presented in Figure 1C were captured from isolated mitochondria under a confocal microscope with SDHA and Cx43(CT) co-staining. Considering the specificity of SDHA (see images below), we believe the Cx43 signal we captured demonstrates the mitochondrial localization/translocation. After using MitoTracker as a mitochondrial marker and higher magnificent images, the correlation coefficient increased from 0.35 to 0.47, a 32% increment with statistical significance. As to the nuclei size, some cells indeed have smaller sizes, which may be affected by varied local cell density. The new images represented in Figure 1A are much more consistent in the nuclei size.

      4) How pure are the mitochondria that were probed for Cx43 by Western shown in Figure 1D? The preparation method described is relatively simple, collecting the 10,000xg supernatant (here 9,000xg supernatant) as mitochondrial fraction. Is it possible that the Cx43 signal, at least in part, is derived from other, contaminating membranes, such as PM, Golgi, or ER? Testing the mitochondrial preparation by Western with marker proteins specific for these compartments would strengthen the author's results.

      The reviewer made a great suggestion. To address this, we did a western blot to test the mitochondrial purity. Indeed, this method using centrifugation is simple, and as expected there were some contamination of ER (marked by PDI) and Golgi (marked by STX6). However, to further confirm the purity of the mitochondrial fraction, fluorescent dyes for mitochondria (MitoTracker Deepred), ER (ER-Tracker Blue-White), and nuclei (Hochest) were used. The organelle-specific dyes indicated most parts of the fraction were mitochondria. There were some contaminations with ER fragments and minimal nuclear contamination. Combining our western blot and immunofluorescence data, it can be concluded that our Cx43 signal is primarily derived from mitochondria.

      5) The authors rely on previous studies to postulate that Cx43 in mitochondria forms hemichannels in their system, is localized in the inner membrane, and is oriented with the Cx43 C-termini facing the inter-membrane space (as schemed in Figure 8C). The authors use lucifer yellow (LY) dye transfer and carbenoxolone, but both are not hemi-channel specific probes. They are transferred by, and block GJ channels as well. Experiments, using hemi-channel specific probes would be more convincing. This is important, as the information cited is based on only two references (Boengler et al., 2009; Miro-Casas et al., 2009), and it still is highly unclear how a membrane protein that is co-translationally inserted into the ER membrane, then traffics through the Golgi to be inserted into the plasma membrane is actually imported into mitochondria and in which state (monomeric, hexameric). Why the Cx43(CT) specific antibody traverses the outer mitochondrial membrane and reaches the Cx43CT while the Cx43(E2) specific antibody is not described and clear either. Where are these mitochondria permeabilized with Triton X-100 as described in M&M?

      We edited the Methods section. We did not use Triton X-100 to permeate mitochondria. PMP appeared to preserve mitochondrial inner membrane integrity allowing us to assess the localization of Cx43(CT) antibody on mitochondria. We showed these new immunofluorescence images in Figure 5- figure supplement 2. PMP used as a plasma membrane permeabilizer has a 6x affinity with MOM compared with MIM. Meanwhile, no Cx43(E2) Ab signal was detected in mitochondria, suggesting the extracellular loop of Cx43 faces the matrix and cannot be accessed by Cx43(E2) antibody.

      The translocation of Cx43 to mitochondria was reported to involve the chaperone Hsp90-dependent TOM complex pathway (Rodriguez-Sinovas et al., 2006). After the translocation, if mtCx43 forms gap junctions in mitochondria is unclear. Lucifer yellow is widely used in hemichannel-mediated dye uptake or gap junction-mediated dye transfer. In our case, considering the channel orientation, mtCx43 should form hemichannels, and Cx43(CT) Ab could be used as a specific Cx43 HCs blocker like the study reported in cardiomyocytes (Lillo et al., 2019).

    1. Author Response

      Reviewer #3 (Public Review):

      The authors showed that D2R antagonism did not affect the initial dip amplitude but shortened the temporal length of the dip and the rebound ACh levels. In addition, by using both ACh and DA sensors, the authors showed DA levels correlate with ACh dip length and rebound level, not the dip amplitude. Both pieces of evidence support their conclusion that DA does not evoke the dip but controls the overall shape of ACh dip. Overall the current study provides solid data and interpretation. The combination of D2R antagonist and CIN-specific Drd2 KO further support a causal relationship between DA and ACh dip. Overall, the experiments are well-designed, carefully conducted and the manuscript is well-written.

      At the behavioral level, the author found a positive correlation between total AUC (of ACh signal dip) and press latency in Figure 10, indicating cholinergic levels contributes to the motivation. The next logic experiment would be to compare the press latency between control and ChAT-Drd2KO mice, since KO mice have smaller AUC while not affecting DA. However, this piece of information was missing in the manuscript. The author instead showed the correlation between AUC and latency was disrupted, which is indirectly related to the conclusion and hard to interpret. Figure 10 showed that eticlopride elongates the press latency, in a dose-dependent manner. However, it is not clear what this press latency means and how it was measured in this CRF task (Since there is no initial cue in the CRF test, how can we define the press latency?).

      We did compare the press latency between control and ChATDrd2KO mice (Figure 10B). At baseline (saline), there is no difference between press latency between these two groups. We measured press latency as the time to press the lever after the lever has been extended. When the lever extends, it makes a sound (cue), which signals to the mice that a new trial has started. The fact that press latency is not enhanced in ChATDrd2KO mice was surprising to us. It is possibly due to compensation via other neuronal mechanisms that regulate press latency (see discussion to comment 6 of public review).

      Pearson r<0.5 is normally defined as a weak correlation. It is better to state r values and discuss that in the manuscript.

      A valid comment. We clarified our correlation analyses in the methods section (line 717):

      “We used a variance explained statistical analysis (R2) to determine the % of variance in our correlation analyses (example: a correlation of 0.5 means 0.52 X 100= 25% of the variance in Y is “explained” or predicted by the X variable. When comparing correlation values, Fisher’s transformation was used to convert Pearson correlation coefficients to z-scores.”

      We also added this to the result section: e.g., line 256: “which accounts for 22% of the variance in the ACh decrease explained by the DA peak.

      Is there any correlation between ACh AUC and other behavior indexes such as press speed or the time between press and reward licking?

      We don’t have the ability to measure press speed and there is no press rate because the lever retracts after the first lever press. We quantified the correlation between time to press until head entry (press to reward latency) and ACh AUC and the results are difficult to interpret. For Drd2f/fl control mice we determined a weak negative correlation (the larger the ACh dip the lower the press to reward latency). In contrast, in ChATDrd2KO mice we found a weak positive correlation between ACh AUC and press to reward latency (the smaller the dip, the lower the press to reward latency). Given these conflicting results, it is difficult to determine how the ACh AUC affects press to reward latency.

      In figure 2B CS+ group, the author was focusing on the responses at CS+, however, the ACh dips at reward delivery seem to persist even after in this particular example. This might be an interesting phenomenon in which ACh got dissociated from DA signals, which needs further analysis from the author.

      We see a persistent signal at reward delivery in both DA and ACh up to the 8 days of testing. However, 1 mouse lost its optical fiber for the GACh signal so the data from Days 6-8 is from 2 mice. We also measured the correlation between DA and ACh at reward delivery for all 8 days of testing (see below). The correlation data is variable with the strongest correlation being observed on Day 2. It is possible that these signals could get dissociated after even more days of testing, but we do not have this data available.

    1. Author Response

      Reviewer #2 (Public Review):

      In general, the study has several novel comments, the experimental design is appropriate and the manuscript is well written. While the manuscript contains a lot of data, still it is a bit descriptive. There are also some issues, which should be addressed.

      1) In Figure 1E, the authors demonstrate a small but significant decrease in body weight of mutant mice. The difference is not so drastic. They also mentioned that some mice showed kyphosis. Please provide data on what percentage of mutant mice showed kyphosis. Please also provide individual hind limb muscle weight normalized with body weight.

      Thank you for your suggestions. The kyphosis was observed in some (more than one third of) Dst-b mutant mice as shown in the author response image 1. MRI or CT imaging of the skeleton is necessary to accurately diagnose kyphosis, however, the imaging was not performed in this paper. Therefore, we would like not to provide data on what percentage of mutant mice showed kyphosis.

      We weighed the soleus of hind limb and demonstrated the data (lines 132-135).

      2) There is a lot of variability in the age of the mice employed for this study. For example, in Figure 3, the authors mentioned 23 months old mice (Fig. 3a) and over 20 months old and over 18 months old mice. What was the exact age of the mice? Why three different age mice were used for the same set of experiments? The authors should also comment on whether the onset of myopathy in skeletal and cardiac muscle occurs at the same or different age in mutant mice.

      According to the comments, we described exact ages in each figure legends. The reason for the variability in age of mice is that we performed a lot of different kinds of experiment at different time points. We described the myopathy phenotypes occurred around 16 months of age and older (lines 128-129). As for cardiomyopathy, fibrosis was observed around 16 months of age and older (Figure 3D,E).

      3) Authors have studied protein aggregation only in the soleus muscle of mutant mice. Do the same types of aggregates also form in cardiomyocytes? They write that desmin aggregates were observed in cardiomyocytes of mutant mice. Please show those results in a supplemental figure.

      According to the suggestion, we presented the data on desmin aggregates in the cardiomyocytes of Dst-bE2610Ter/E2610Ter mice (Figure 4-figure supplement 1).

      4) In Figure 5, the authors suggest that mutant mice have mitochondrial abnormalities. However, this analysis is quite abstract and inconclusive. Immunohistochemical images show higher levels of CytoC and Tom20 whereas QRT-PCR demonstrates a significant decrease in mRNA levels of some of the mitochondria-related molecules. Authors should perform additional experiments to determine whether there is any difference in mitochondrial content between WT and mutant mice. In addition, they should perform some functional assays (i.e. OCR, seahorse experiment etc.) to measure mitochondria oxidative phosphorylation capacity is affected in mutant mice.

      Thank you very much for the comment. Mitochondrial accumulation was a characteristic phenotype in Dst-bE2610Ter/E2610Ter muscle and also in other types of MFM. We performed quantitative analyses and added the data (Figure 5B). Mitochondrial accumulation was observed even in young stage when protein aggregates were not observed (Figure 3-figure supplement 1A). As the reviewer pointed out, it is important to demonstrate changes in mitochondrial function, but at this moment, we do not have that assay system and would like to present it as data for a future paper, including analysis on mitophagy.

      5) The morphology of the mitochondria in TEM images shows features that are commonly observed during oxidative damage. Is there any evidence of oxidative stress in skeletal and cardiac muscle of mutant mice?

      Thank you very much for the insightful comment. Gene ontology and KEGG pathway analysis on RNA-seq data did not show alterations of oxidative stress in the heart. We performed q-PCR for genes associated with oxidative stress in soleus (Figure 1-figure supplement 3), which did not show alterations in oxidative stress. In the future, we would to investigate on this point.

      Reviewer #3 (Public Review):

      This manuscript by Yoshioka et al. provides an extensive analysis of cardiac and skeletal muscle in a mouse model of Dst-b mutation. The authors have generated the mutant mouse model to selectively mutate Dst-b isoform of Dystonin and show that such a mutation leads to cardiomyopathy and late-onset myofibrillar myopathy. This is a novel discovery which adds valuable information to the genetic basis and molecular mechanism of MFM mediated by Dst-b. However, the manuscript needs substantial revision and additional feasible experiments.

      In Figure3A, the authors suggest that there are smaller myofibers in the mutated mice however they do not provide enough data to support that. Cross-sectional areas between the mutant and WT have to be counted and represented as bins. This can better show the presence of smaller myofibers and muscle degeneration in the mutant mice.

      Thank you for the helpful comment. We quantified distribution of cross-sectional area (CSA) in the soleus and then the data was indicated in Figure 3C. It indicates that there are smaller myofibers in the mutant mice.

      In Figure 3A-B, the authors show that mutant mice have significantly more myofibers with centrally located myonuclei indicating the constant degeneration and regeneration in the mutant mice. Another indicator of this is the number of activated muscle stem cells. Under homeostasis, authors can compare the number of quiescent muscle stem cells and activated muscle stem cells. If there is constant degeneration and regeneration in the mutant muscle, there will be more cycling muscle stem cells and that will further prove such phenotype in question. Alternatively, they can use EdU water and quantify the number of EdU+/Pax7+ cells between the mutant and WT.

      Thank you very much for the interesting comment. We agree that the subject of muscle regeneration in Dst-b mutant mice to be interesting. The authors tried to address this issue by making ISH probes for Pax7 and Emerin, which label muscle stem cells (image below). However, we were unable to reach a conclusion at this time. We intend to address this issue in the future.

      In figure 2F, the authors show behavioral tests on the mutant mice of age 1 year. They do not show any significant difference in muscle strength. However, most of the myopathic phenotypes they observe are at 23 months of age, these behavioral tests can be repeated at that age to see if there is more muscle weakness in the mutant mice compared to the WT. Also, are these behavioral test readouts affected by the cardiomyopathy independent of skeletal muscle strength?

      We have used rotarod test and wire hang test to evaluate motor coordination and have reported impairment of motor performance in dt mice (Horie et al., 2020). The purpose of these behavior tests in the present study was to evaluate motor coordination of Dst-b mutant mice compared to dt mice, not to address the skeletal muscle function. The text has been changed to clarify this point (lines 121-123).

      Generally speaking, these behavioral tests, especially the rotarod test, may be affected by cardiac abnormalities. However, it is difficult to draw conclusions from the results of this study, since there were no significant differences in the behavioral experiments.

      They show in Figure 3B that the number of CNF's are affected to a different extent in different muscles. These muscles have a different composition of myofibers, one consisting mostly of slow-type fibers while the other is mostly of fast-type. The question of whether Dst-b mutation effect of muscle fiber types is not clear. Is there a difference?

      Thank you very much for insightful comment. We performed qPCR to evaluate whether Dst-b mutation affects the myofiber type of soleus muscle (Figure 1-figure supplement 3B). Expression levels of the genes did not change between WT and Dst-b mutant mice.

      The cardiac myopathy phenotype that is clearly shown in figure 3 is shown in mice of 16 months of age whereas the skeletal muscle myopathy phenotype is shown in 23-month-old mice. The reason for the choice of the age of the mice should be discussed. Does the cardiac phenotype precede the skeletal muscle phenotype? Have they looked at the skeletal muscle phenotype at earlier ages? If so, that data should be provided as well and discussed.

      Thank you for the comment. We analyzed myopathy and cardiomyopathy phenotypes in mice aged between 16-23 months and then have chosen histological photographs with the high quality. As shown in Figure 3B, CNFs increased in the soleus from all Dst-b mutant mice aged between 16-23 months. We added description that skeletal myopathy phenotypes occurred at 16-month-old mice.

      The authors clearly show the formation of protein aggregates in the myofibers in the mutant mice. They further characterize the composition of these desmin aggregates by determining their co aggregates such as plectin and ab-Crystallin. Another component of the z-disk that has been shown to be involved in the aggregates in MFM is myotilin. The authors should also show the presence/ absence and co-aggregation of this protein with the desmin aggregates present in the mutant mice.

      According with the suggestion, we performed immunohistochemistry of myotilin. Myotilin was abnormally accumulated in myofibers of the soleus from Dst-b mutant mice. We thank the nice comment and added the data in Figure 4-figure supplement 2.

      The authors show abnormal accumulation of mitochondria through cyt c and Tom20 staining. The increased Tom20 levels in the mutant are shown in figure 5A which is from mice that are 23-month-old. However, in figure 3-figure supplement 1a they also show elevated Tom20 staining in the mutant mice that are only 1-2 months old. However, no other phenotype is observed at this age except for the disrupted mitochondria according to the data provided. This needs to be discussed and addressed.

      We would like to correct that the data in figure 3-figure supplement 1a is 3-4 months old mutant mice. These data show that mitochondrial accumulation precedes CNF and desmin aggregation. We have described this point in the text (lines 206-209).

      In Figure 5, the authors show changes in gene expression levels of genes involved in oxidative phosphorylation which supports the disrupted mitochondrial function. Additionally, ROS levels could be compared between the WT and mutant mice.

      To address the involvement of oxidative stress, we performed q-PCR for genes associated with oxidative stress response in soleus (Figure 1-figure supplement 3C). qPCR data did not show alterations in such genes. In the future, we would like to investigate on this point.

      In Figure 5 authors show disrupted oxidative phosphorylation in the mutant soleus muscle. Is this also associated with the fiber-type switch? Since mouse soleus muscle is a mix of fast and slow fiber types, they can look at differences in gene expression of key marker genes for slow and fast myofibers.

      Thank you very much for the suggestion. We quantified expression levels of muscle fiber-type marker genes (Figure 1-figure supplement 3B). There is no data to suggest the fiber-type switch.

      In figure 2, the authors show that mutant mice increase their body weight at a normal pace until 13 weeks of age after which the mutant mice become lighter than their WT counterparts. Is this suggestive of loss of muscle mass? If so, the authors show the muscle atrophy phenotype in 23-month-old mice with cross-sections. Does this mean muscle atrophy starts at an earlier age at 16 months in these mice? Please provide details on the age of the mice for each experiment. In addition, in the text (line 121) authors phrase that the mutant mice become leaner. Lean usually means a decrease in fat mass and an increase in muscle mass. Is this the case? If so, there is no data to support that and the phenotype in the mutant mice suggests there is muscle atrophy in these mice. Therefore, it would not be appropriate to suggest that these mice get lean. However, it is interesting that the bodyweight of the mutant mice gets significantly lighter after 13 weeks. EchoMRI analysis can be performed between these mice to see the total body composition to determine if there is a change in the different type of fat, lean or water composition.

      Thank you for your comments. We provided exact ages in each figure legend. We described that skeletal myopathy phenotypes occur as early as 16-month-old mice, and CSA analysis showed that increased small caliber myofibers in the soleus of Dst-b mutant mice. However, muscle mass of the soleus normalized by body weight was not significantly different between control and Dst-bE2610Ter/E2610Ter mice. Therefore, muscle atrophy may be not significant enough to affect muscle weight.

      Because we have not quantified the fat mass in Dst-b mutant mice, we changed the phrase from “the mutant mice become leaner” to “they become lower body weight compare with WT mice” (line 120).

      Authors have performed RNA-Seq for the left ventricle from the mutant and the WT mice. Separate clustering of the WT and the mutant has to be shown at least through a PCA plot. Some IGV tracks to show the expression level changes in key genes between the mutant and WT should be shown as well. In addition, they could show how some of the genes involved in autophagy and protein degradation are affected since these are mainly the mechanism by which there is protein aggregation in MFM's.

      Thank you for your helpful comment. We performed principal component analysis (PCA) and hierarchical clustering. The data showed that transcriptomic features of WT and Dst-b mutant hearts are separated (Figure 8-figure supplement 1A, B). To evaluate the change in expression level of genes, we also performed real time-PCR (Figure 8-figure supplement 1C). Our Gene ontology analysis and KEGG pathway analysis on RNA-seq data in the heart did not suggest the alterations in autophagy and protein degradation, while many genes responsible for unfolded protein response affected (Figure 8C, Figure 8-figure supplement 1C). Previous studies have reported that unfolded protein response is abnormal in several animal models for myofibrillar myopathy (Winter et al., 2014; Fang et al., J Clin Invest, 2017). We would like to investigate underlying mechanisms of protein aggregates in Dst-b mutant myofibers in the future.

    1. Author response:

      Reviewer #1 (Public Review):

      This paper proposes a novel framework for explaining patterns of generalization of force field learning to novel limb configurations. The paper considers three potential coordinate systems: cartesian, joint-based, and object-based. The authors propose a model in which the forces predicted under these different coordinate frames are combined according to the expected variability of produced forces. The authors show, across a range of changes in arm configurations, that the generalization of a specific force field is quite well accounted for by the model.

      The paper is well-written and the experimental data are very clear. The patterns of generalization exhibited by participants - the key aspect of the behavior that the model seeks to explain - are clear and consistent across participants. The paper clearly illustrates the importance of considering multiple coordinate frames for generalization, building on previous work by Berniker and colleagues (JNeurophys, 2014). The specific model proposed in this paper is parsimonious, but there remain a number of questions about its conceptual premises and the extent to which its predictions improve upon alternative models.

      A major concern is with the model's premise. It is loosely inspired by cue integration theory but is really proposed in a fairly ad hoc manner, and not really concretely founded on firm underlying principles. It's by no means clear that the logic from cue integration can be extrapolated to the case of combining different possible patterns of generalization. I think there may in fact be a fundamental problem in treating this control problem as a cue-integration problem. In classic cue integration theory, the various cues are assumed to be independent observations of a single underlying variable. In this generalization setting, however, the different generalization patterns are NOT independent; if one is true, then the others must inevitably not be. For this reason, I don't believe that the proposed model can really be thought of as a normative or rational model (hence why I describe it as 'ad hoc'). That's not to say it may not ultimately be correct, but I think the conceptual justification for the model needs to be laid out much more clearly, rather than simply by alluding to cue-integration theory and using terms like 'reliability' throughout.

      We thank the reviewer for bringing up this point. We see and treat this problem of finding the combination weights not as a cue integration problem but as an inverse optimal control problem. In this case, there can be several solutions to the same problem, i.e., what forces are expected in untrained areas, which can co-exist and give the motor system the option to switch or combine them. This is similar to other inverse optimal control problems, e.g. combining feedforward optimal control models to explain simple reaching. However, compared to these problems, which fit the weights between different models, we proposed an explanation for the underlying principle that sets these weights for the dynamics representation problem. We found that basing the combination on each motor plan's reliability can best explain the results. In this case, we refer to ‘reliability’ as execution reliability and not sensory reliability, which is common in cue integration theory. We have added further details explaining this in the manuscript.

      “We hypothesize that this inconsistency in results can be explained using a framework inspired by an inverse optimal control framework. In this framework the motor system can switch or combine between different solutions. That is, the motor system assigns different weights to each solution and calculates a weighted sum of these solutions. Usually, to support such a framework, previous studies found the weights by fitting the weighed sum solution to behavioral data (Berret, Chiovetto et al. 2011). While we treat the problem in the same manner, we propose the Reliable Dynamics Representation (Re-Dyn) mechanism that determines the weights instead of fitting them. According to our framework, the weights are calculated by considering the reliability of each representation during dynamic generalization. That is, the motor system prefers certain representations if the execution of forces based on this representation is more robust to distortion arising from neural noise. In this process, the motor system estimates the difference between the desired generalized forces and generated generalized forces while taking into consideration noise added to the state variables that equivalently define the forces.”

      A more rational model might be based on Bayesian decision theory. Under such a model, the motor system would select motor commands that minimize some expected loss, averaging over the various possible underlying 'true' coordinate systems in which to generalize. It's not entirely clear without developing the theory a bit exactly how the proposed noise-based theory might deviate from such a Bayesian model. But the paper should more clearly explain the principles/assumptions of the proposed noise-based model and should emphasize how the model parallels (or deviates from) Bayesian-decision-theory-type models.

      As we understand the reviewer's suggestion, the idea is to estimate the weight of each coordinate system based on minimizing a loss function that considers the cost of each weight multiplied by a posterior probability that represents the uncertainty in this weight value. While this is an interesting idea, we believe that in the current problem, there are no ‘true’ weight values. That is, the motor system can use any combination of weights which will be true due to the ambiguous nature of the environment. Since the force field was presented in one area of the entire workspace, there is no observation that will allow us to update prior beliefs regarding the force nature of the environment. In such a case, the prior beliefs might play a role in the loss function, but in our opinion, there is no clear rationale for choosing unequal priors except guessing or fitting prior probabilities, which will resemble any other previous models that used fitting rather than predictions.

      Another significant weakness is that it's not clear how closely the weighting of the different coordinate frames needs to match the model predictions in order to recover the observed generalization patterns. Given that the weighting for a given movement direction is over- parametrized (i.e. there are 3 variable weights (allowing for decay) predicting a single observed force level, it seems that a broad range of models could generate a reasonable prediction. It would be helpful to compare the predictions using the weighting suggested by the model with the predictions using alternative weightings, e.g. a uniform weighting, or the weighting for a different posture. In fact, Fig. 7 shows that uniform weighting accounts for the data just as well as the noise-based model in which the weighting varies substantially across directions. A more comprehensive analysis comparing the proposed noise-based weightings to alternative weightings would be helpful to more convincingly argue for the specificity of the noise-based predictions being necessary. The analysis in the appendix was not that clearly described, but seemed to compare various potential fitted mixtures of coordinate frames, but did not compare these to the noise-based model predictions.

      We agree with the reviewer that fitted global weights, that is, an optimal weighted average of the three coordinate systems should outperform most of the models that are based on prediction instead of fitting the data. As we showed in Figure 7 of the submitted version of the manuscript, we used the optimal fitted model to show that our noise-based model is indeed not optimal but can predict the behavioral results and not fall too short of a fitted model. When trying to fit a model across all the reported experiments, we indeed found a set of values that gives equal weights for the joints and object coordinate systems (0.27 for both), and a lower value for the Cartesian coordinate system (0.12). Considering these values, we indeed see how the reviewer can suggest a model that is based on equal weights across all coordinate systems. While this model will not perform as well as the fitted model, it can still generate satisfactory results.

      To better understand if a model based on global weights can explain the combination between coordinate systems, we perform an additional experiment. In this experiment, a model that is based on global fitted weights can only predict one out of two possible generalization patterns while models that are based on individual direction-predicted weights can predict a variety of generalization patterns. We show that global weights, although fitted to the data, cannot explain participants' behavior. We report these new results in Appendix 2.

      “To better understand if a model based on global weights can explain the combination between coordinate systems, we perform an additional experiment. We used the idea of experiment 3 in which participants generalize learned dynamics using a tool. That is, the arm posture does not change between the training and test areas. In such a case, the Cartesian and joint coordinate systems do not predict a shift in generalized force pattern while the object coordinate system predicts a shift that depends on the orientation of the tool. In this additional experiment, we set a test workspace in which the orientation of the tool is 90° (Appendix 2- figure 1A). In this case, for the test workspace, the force compensation pattern of the object based coordinate system is in anti-phase with the Cartesian/joint generalization pattern. Any globally fitted weights (including equal weights) can produce either a non-shifted or 90° shifted force compensation pattern (Appendix 2- figure 1B). Participants in this experiment (n=7) showed similar MPE reduction as in all previous experiments when adapting to the trigonometric scaled force field (Appendix 2- figure 1C). When examining the generalized force compensation patterns, we observed a shift of the pattern in the test workspace of 14.6° (Appendix 2- figure 1D). This cannot be explained by the individual coordinate system force compensation patterns or any combination of them (which will always predict either a 0° or 90° shift, Appendix 2- figure 1E). However, calculating the prediction of the Re-Dyn model we found a predicted force compensation pattern with a shift of 6.4° (Appendix 2- figure 1F). The intermediate shift in the force compensation pattern suggests that any global based weights cannot explain the results.”

      With regard to the suggestion that weighting is changed according to arm posture, two of our results lower the possibility that posture governs the weights:

      (1) In experiment 3, we tested generalization while keeping the same arm posture between the training and test workspaces, and we observed different force compensation profiles across the movement directions. If arm posture in the test workspaces affected the weights, we would expect identical weights for both test workspaces. However, any set of weights that can explain the results observed for workspace 1 will fail to explain the results observed in workspace 2. To better understand this point we calculated the global weights for each test workspace for this experiment and we observed an increase in the weight for the object coordinates system (0.41 vs. 0.5) and a reduction in the weights for the Cartesian and joint coordinates systems (0.29 vs. 0.24). This suggests that the arm posture cannot explain the generalization pattern in this case.

      (2) In experiments 2 and 3, we used the same arm posture in the training workspace and either changed the arm posture (experiment 2) or did not change the arm posture (experiment 3) in the test workspaces. While the arm posture for the training workspace was the same, the force generalization patterns were different between the two experiments, suggesting that the arm posture during the training phase (adaptation) does not set the generalization weights.

      Overall, this shows that it is not specifically the arm posture in either the test or the training workspaces that set the weights. Of course, all coordinate models, including our noise model, will consider posture in the determination of the weights.

      Reviewer #2 (Public Review):

      Leib & Franklin assessed how the adaptation of intersegmental dynamics of the arm generalizes to changes in different factors: areas of extrinsic space, limb configurations, and 'object-based' coordinates. Participants reached in many different directions around 360{degree sign}, adapting to velocity-dependent curl fields that varied depending on the reach angle. This learning was measured via the pattern of forces expressed in upon the channel wall of "error clamps" that were randomly sampled from each of these different directions. The authors employed a clever method to predict how this pattern of forces should change if the set of targets was moved around the workspace. Some sets of locations resulted in a large change in joint angles or object-based coordinates, but Cartesian coordinates were always the same. Across three separate experiments, the observed shifts in the generalized force pattern never corresponded to a change that was made relative to any one reference frame. Instead, the authors found that the observed pattern of forces could be explained by a weighted combination of the change in Cartesian, joint, and object-based coordinates across test and training contexts.

      In general, I believe the authors make a good argument for this specific mixed weighting of different contexts. I have a few questions that I hope are easily addressed.

      Movements show different biases relative to the reach direction. Although very similar across people, this function of biases shifts when the arm is moved around the workspace (Ghilardi, Gordon, and Ghez, 1995). The origin of these biases is thought to arise from several factors that would change across the different test and training workspaces employed here (Vindras & Viviani, 2005). My concern is that the baseline biases in these different contexts are different and that rather the observed change in the force pattern across contexts isn't a function of generalization, but a change in underlying biases. Baseline force channel measurements were taken in the different workspace locations and conditions, so these could be used to show whether such biases are meaningfully affecting the results.

      We agree with the reviewer and we followed their suggested analysis. In the following figure (Author response image 1) we plotted the baseline force compensation profiles in each workspace for each of the four experiments. As can be seen in this figure, the baseline force compensation is very close to zero and differs significantly from the force compensation profiles after adaptation to the scaled force field.

      Author response image 1.

      Baseline force compensation levels for experiments 1-4. For each experiment, we plotted the force compensation for the training, test 1, and test 2 workspaces.

      Experiment 3, Test 1 has data that seems the worst fit with the overall story. I thought this might be an issue, but this is also the test set for a potentially awkwardly long arm. My understanding of the object-based coordinate system is that it's primarily a function of the wrist angle, or perceived angle, so I am a little confused why the length of this stick is also different across the conditions instead of just a different angle. Could the length be why this data looks a little odd?

      Usually, force generalization is tested by physically moving the hand in unexplored areas. In experiment 3 we tested generalization using a tool which, as far as we know, was not tested in the past in a similar way to the present experiment. Indeed, the results look odd compared to the results of the other experiments, which were based on the ‘classic’ generalization idea. While we have some ideas regarding possible reasons for the observed behavior, it is out of the scope of the current work and still needs further examination.

      Based on the reviewer’s comment, we improved the explanation in the introduction regarding the idea behind the object based coordinate system

      “we could represent the forces as belonging to the hand or a hand-held object using the orientation vector connecting the shoulder and the object or hand in space (Berniker, Franklin et al. 2014).” The reviewer is right in their observation that the predictions of the object-based reference frame will look the same if we change the length of the tool. The object-based generalized forces, specifically the shift in the force pattern, depend only on the object's orientation but not its length (equation 4).

      The manuscript is written and organized in a way that focuses heavily on the noise element of the model. Other than it being reasonable to add noise to a model, it's not clear to me that the noise is adding anything specific. It seems like the model makes predictions based on how many specific components have been rotated in the different test conditions. I fear I'm just being dense, but it would be helpful to clarify whether the noise itself (and inverse variance estimation) are critical to why the model weights each reference frame how it does or whether this is just a method for scaling the weight by how much the joints or whatever have changed. It seems clear that this noise model is better than weighting by energy and smoothness.

      We have now included further details of the noise model and added to Figure 1 to highlight how noise can affect the predicted weights. In short, we agree with the reviewer there are multiple ways to add noise to the generalized force patterns. We choose a simple option in which we simulate possible distortions to the state variables that set the direction of movement. Once we calculated the variance of the force profile due to this distortion, one possible way is to combine them using an inverse variance estimator. Note that it has been shown that an inverse variance estimator is an ideal way to combine signals (e.g., Shahar, D.J. (2017) https://doi.org/10.4236/ojs.2017.72017). However, as we suggest, we do not claim or try to provide evidence for this specific way of calculating the weights. Instead, we suggest that giving greater weight to the less variable force representation can predict both the current experimental results as well as past results.

      Are there any force profiles for individual directions that are predicted to change shape substantially across some of these assorted changes in training and test locations (rather than merely being scaled)? If so, this might provide another test of the hypotheses.

      In experiments 1-3, in which there is a large shift of the force compensation curve, we found directions in which the generalized force was flipped in direction. That is, clockwise force profiles in the training workspace could change into counter-clockwise profiles in the test workspace. For example, in experiment 2, for movement at 157.5° we can see that the force profile was clockwise for the training workspace (with a force compensation value of 0.43) and movement at the same direction was counterclockwise for test workspace 1 (force compensation equal to -0.48). Importantly, we found that the noise based model could predict this change.

      Author response image 2.

      Results of experiment 2. Force compensation profiles for the training workspace (grey solid line) and test workspace 1 (dark blue solid line). Examining the force nature for the 157.5° direction, we found a change in the applied force by the participants (change from clockwise to counterclockwise forces). This was supported by a change in force compensation value (0.43 vs. -0.48). The noise based model can predict this change as shown by the predicted force compensation profile (green dashed line).

      I don't believe the decay factor that was used to scale the test functions was specified in the text, although I may have just missed this. It would be a good idea to state what this factor is where relevant in the text.

      We added an equation describing the decay factor (new equation 7 in the Methods section) according to this suggestion and Reviewer 1 comment on the same issue.

      Reviewer #3 (Public Review):

      The author proposed the minimum variance principle in the memory representation in addition to two alternative theories of the minimum energy and the maximum smoothness. The strength of this paper is the matching between the prediction data computed from the explicit equation and the behavioral data taken in different conditions. The idea of the weighting of multiple coordinate systems is novel and is also able to reconcile a debate in previous literature.

      The weakness is that although each model is based on an optimization principle, but the derivation process is not written in the method section. The authors did not write about how they can derive these weighting factors from these computational principles. Thus, it is not clear whether these weighting factors are relevant to these theories or just hacking methods. Suppose the author argues that this is the result of the minimum variance principle. In that case, the authors should show a process of how to derive these weighting factors as a result of the optimization process to minimize these cost functions.

      The reviewer brings up a very important point regarding the model. As shown below, it is not trivial to derive these weights using an analytical optimization process. We demonstrate one issue with this optimization process.

      The force representation can be written as (similar to equation 6):

      We formulated the problem as minimizing the variance of the force according to the weights w:

      In this case, the variance of the force is the variance-covariance matrix which can be minimized by minimizing the matrix trace:

      We will start by calculating the variance of the force representation in joints coordinate system:

      Here, the force variance is a result of a complex function which include the joints angle as a random variable. Expending the last expression, although very complex, is still possible. In the resulted expression, some of the resulted terms include calculating the variance of nested trigonometric functions of the random joint angle variance, for example:

      In the vast majority of these cases, analytical solutions do not exist. Similar issues can also raise for calculating the variance of complex multiplication of trigonometric functions such as in the case of multiplication of Jacobians (and inverse Jacobians)

      To overcome this problem, we turned to numerical solutions which simulate the variance due to the different state variables.

      In addition, I am concerned that the proposed model can cancel the property of the coordinate system by the predicted variance, and it can work for any coordinate system, even one that is not used in the human brain. When the applied force is given in Cartesian coordinates, the directionality in the generalization ability of the memory of the force field is characterized by the kinematic relationship (Jacobian) between the Cartesian coordinate and the coordinate of interest (Cartesian, joint, and object) as shown in Equation 3. At the same time, when a displacement (epsilon) is considered in a space and a corresponding displacement is linked with kinematic equations (e.g., joint displacement and hand displacement in 2 joint arms in this paper), the generated variances in different coordinate systems are linked with the kinematic equation each other (Jacobian). Thus, how a small noise in a certain coordinate system generates the hand force noise (sigma_x, sigma_j, sigma_o) is also characterized by the kinematics (Jacobian). Thus, when the predicted forcefield (F_c, F_j, F_o) was divided by the variance (F_c/sigma_c^2, F_j/sigma_j^2, F_o/sigma_o^2, ), the directionality of the generalization force which is characterized by the Jacobian is canceled by the directionality of the sigmas which is characterized by the Jacobian. Thus, as it has been read out from Fig*D and E top, the weight in E-top of each coordinate system is always the inverse of the shift of force from the test force by which the directionality of the generalization is always canceled.

      Once this directionality is canceled, no matter how to compute the weighted sum, it can replicate the memorized force. Thus, this model always works to replicate the test force no matter which coordinate system is assumed. Thus, I am suspicious of the falsifiability of this computational model. This model is always true no matter which coordinate system is assumed. Even though they use, for instance, the robot coordinate system, which is directly linked to the participant's hand with the kinematic equation (Jacobian), they can replicate this result. But in this case, the model would be nonsense. The falsifiability of this model was not explicitly written.

      As explained above, calculating the variability of the generalized forces given the random nature of the state variable is a complex function that is not summarized using a Jacobian. Importantly the model is unable to reproduce or replicate the test force arbitrarily. In fact, we have already shown this (see Appendix 1- figure 1), where when we only attempt to explain the data with either a single coordinate system (or a combination of two coordinate systems) we are completely unable to replicate the test data despite using this model. For example, in experiment 4, when we don’t use the joint based coordinate system, the model predicts zero shift of the force compensation pattern while the behavioral data show a shift due to the contribution of the joint coordinate system. Any arbitrary model (similar to the random model we tested, please see the response to Reviewer 1) would be completely unable to recreate the test data. Our model instead makes very specific predictions about the weighting between the three coordinate systems and therefore completely specified force predictions for every possible test posture. We added this point to the Discussion

      “The results we present here support the idea that the motor system can use multiple representations during adaptation to novel dynamics. Specifically, we suggested that we combine three types of coordinate systems, where each is independent of the other (see Appendix 1- figure 1 for comparison with other combinations). Other combinations that include a single or two coordinate system can explain some of the results but not all of them, suggesting that force representation relies on all three with specific weights that change between generalization scenarios.”

    1. Author Response:

      Reviewer #1 (Public Review):

      Garcia-Souto, Bruzos, and Diaz et al. analyzed hemic neoplasia in warty venus clams at multiple sites throughout Europe. They identified cases of disease in two locations, in Galicia and in the Mediterranean. They then use Illumina sequencing to discover that the samples with cancer DNA had reads which mapped to the mtDNA reference sequences from a different clam species in the same family, suggesting a cross-species transmissible cancer. By mapping reads to both the V. verrucosa and C. gallina mitogenomes they showed that more reads mapped to C. gallina in cancer samples compared to matched host tissue samples, and this was consistent across the whole mitogenome. Phylogenetic analysis of mtDNA genes of the host and cancer samples as well as identification of SNVs at a short region of one single-copy nuclear locus suggest that all cancer samples come from a single C. gallina transmissible cancer clone. All data agree that a single lineage of cancer from C. gallina is responsible for all identified cancers in V. verrucosa.

      There are a few sections where there are either unclear methods or the methods do not quite match the descriptions of the results. 1. Regarding mapping of reads to different reference Cox1 sequences (for Figure 2a): "Then, we mapped the paired-end reads onto a dataset containing non-redundant mitochondrial Cytochrome C Oxidase subunit 1 (Cox1) gene references from 137 Vererid clam species." I do not see where this is explained anywhere in the methods, where this list of references comes from, or what is in it.

      Answer: We retrieved a dataset of 3,745 sequences comprising all the barcode-identified venerid clam Cox1 fragments available from the Barcode of Life Data System (BOLD, http://www.boldsystemns.org/). Redundancy was removed using CD-HIT (Fu, et al. 2012), applying a cut-off of 0.9 sequence identity, and sequences were trimmed to cover the same region. Whole-genome sequencing data from both healthy and tumoral warty venus clams was mapped onto this dataset, containing 118 venerid species-unique sequences, using BWA-mem, filtering out reads with mapping quality below 60 (-q60) and quantifying the overall coverage for each sequence with samtools idxstats. PCR primers were designed with Primer3 v2.3.7 (Koressaar, et al. 2018) to amplify a fragment of 354 bp from the Cox1 mitochondrial gene of V. verrucosa and C. gallina (F: CCT ATA ATA ATT GGK GGA TTT GG, R: CCT ATA ATA ATT GGK GGA TTT GG). PCR products were purified with ExoSAP-IT and sequenced by Sanger sequencing.

      Action: We have included this new information in the methods section.

      1. Regarding de novo assembly of mitogenomes: "Hence, we employed bioinformatic tools to reconstruct the full mitochondrial DNA (mtDNA) genomes in representative animals from the two species involved....Then, we mapped the paired-end sequencing data from the six neoplastic specimens with evidence of interspecies cancer transmission onto the two reconstructed species-specific mtDNA genomes." In contrast to this, the methods say, "Then, we run MITObim v1.9.1 (Hahn, Bachmann, & Chevreux, 2013) to assemble the full mitochondrial genome of all sequenced samples, using gene baits from the following Cox1 and 16S reference genes to prime the assembly of clam mitochondrial genomes." It is unclear which method was used.

      Answer: In total, we performed whole-genome sequencing on 23 samples from 16 clam specimens, which includes eight neoplastic and eight non-neoplastic animals by Illumina pairedend libraries of 350 bp insert size and reads 150 bp long. First we assembled the mitochondrial genomes of one V. verrucosa (FGVV18_193), one C. gallina (ECCG15_201) and one C. striatula (EVCS14_02) specimens with MITObim v1.9.1 (Hahn, et al. 2013), using gene baits from the 7 following Cox1 and 16S reference genes to prime the assembly of clam mitochondrial genomes: V. verrucosa (Cox1, with GenBank accession number KC429139; and 16S: C429301), C. gallina (Cox1: KY547757, 16S: KY547777) and C. striatula (Cox1: KY547747, 16S: KY547767). These draft sequences were polished twice with Pilon v1.23 (Walker, et al. 2014), and conflictive repetitive fragments from the mitochondrial control region were resolved using long read sequencing with Oxford Nanopore technologies (ONT) on a set of representative samples from each species and tumours. ONT reads were assembled with Miniasm v0.3 (Li 2016) and corrected using Racon v1.3.1 (Vaser, et al. 2017). Protein-coding genes, rDNAs and tRNAs were annotated on the curated mitochondrial genomes using MITOS2 web server (Bernt, et al. 2013), and manually curated to fit ORFs as predicted by ORF-FINDER (Rombel, et al. 2002). Then, we employed the entire mitochondrial DNAs of V. verrucosa (FGVV18_193) and C. gallina (ECCG15_201) as “references” to map reads from individuals with neoplasia, filter reads matching either mitogenome and assemble and polish their two (healthy and tumoral) mitogenomes individually as above. Further healthy individuals were later sequenced and their mitogenomes assembled, to further investigate the geographic and taxonomic spread of this neoplasia.

      Action: We have included this information in the methods section (page 21-22), and in the results (pages 7 and 8). mtDNA annotations are now shown in Supplementary Figure 3. Nucleotide data for the mitochondrial DNA assemblies has been uploaded to GenBank under accession numbers MW662590-MW662611 and will be released upon publication or request.

      There is one minor claim which may not be fully supported by the data: the statement that, "The analysis of mitochondrial and nuclear gene sequences revealed no nucleotide divergence between the seven tumours sequenced." If I am understanding the filtering of the SNVs from the nuclear gene correctly, only the presence or absence of the 14 SNVs that were fixed within each of the two species were analyzed. Therefore, it is unclear whether the authors looked for any additional somatic mutations within the cancer lineage that would have occurred at other positions. For mitochondria, the authors state that sequences were "extracted from paired-end sequencing data," but it is not explained how this was done. The data suggest that there are no differences between cancer samples in the 13 coding genes and 2 rDNA genes, but data on possible SNVs in the intergenic regions is not shown.

      Answer: We obtained a preliminary nuclear assembly using short-reads only. Obviously, the resulting assemblies are fragmented and incomplete. This has limited the identification of candidate regions shared by the three genomes (V. verrucosa and both Chamelea clams). Out of the 44 candidate nuclear fragments we tested, only two (DEAH12 and TFHII) turned out to give good PCR products, adequate for Sanger sequencing. As mentioned above, we now provide additional data on a second gene (TFIIH), identified and selected on the same basis as DEAH12. We find 14 and 15 sites, respectively, for the DEAH12 and the TFIIH loci, with fixed SNVs (allele frequency >95%) that allowed to discriminate between the three relevant species (V. verrucosa, C. gallina and C. striatula) and the tumour. These diagnostic nucleotides were then used to filter the reads from individuals with neoplasia harbouring both DNA’s. Variation within the host lineage but not within the tumour was found along the nuclear DNA fragments employen in the ML phylogenies (see figure below).

      Figure. Molecular phylogenies based on the two selected nuclear markers. (a) DEAH12 gene and (b) TFIIH gene, and diagnostic loci discriminating among species and tumour. Bootstrap support values (500 replicates) from ML analyses above 50 are shown above the corresponding branches. Note all diagnostic nucleotides are identical between tumours (black dots).

      Regarding the mtDNA, firstly, we assembled the mitochondrial genomes of one V. verrucosa (FGVV18_193), one C. gallina (ECCG15_201) and one C. striatula (EVCS14_02) specimens with MITObim v1.9.1 (Hahn, et al. 2013). Then, we employed the entire mitochondrial DNAs from V. verrucosa (FGVV18_193) and C. gallina (ECCG15_201) as “references” to map reads from individuals with neoplasia, filter reads matching either mitogenome and assemble and polish their two (healthy and tumoral) mitogenomes individually as above. Further healthy individuals were later sequenced and their mitogenomes assembled, to further investigate the geographic and taxonomic spread of this neoplasia. Despite the usefulness of the mitochondrial control region (CR) to detect differences among lineages, we refrained from using it for two reasons. (1) The CR shows considerable variation in both length and sequence among the three species, making their alignment difficult (in fact, previous phylogenetic studies based on whole mitochondrial DNA sequences in Veneridae excluded the CR: https://doi.org/10.1111/zsc.12454), and (2) the CR contains quasi-but-not-identical tandem repeats, as a other mollusks (i.e., the Venerid Dosinia clams https://doi.org/10.1371/journal.pone.0196466 or the Littorina marine snails https://doi.org/10.1016/j.margen.2016.10.006). In our case, repeats are larger than the short-reads insert size, and even though we could infer them by means of long read sequencing, polishing the resulting consensus sequences to overcome the intrinsic error rate of those lectures would yield inconclusive results, hindering the comparison between normal and tumoral haplotypes.

      Action: We updated the methods for the mitochondrial DNA analyses (pages 21-22, 24) and the nuclear DNA analyses (page 23). We now include new data in the results and discussion (pages 9-10).

      Reviewer #2 (Public Review):

      In rare but well-documented instances, certain types of cancers can transmit horizontally. These transmissible cancers have a clonal origin and have adapted to bypass allorecognition. A form of marine leukemia (hemic neoplasia or HM) belongs to this class of transmissible cancers and has been detected in several bivalve species (oysters, mussels, cockles and clams). Although HM mostly propagates within the same bivalve species, instances of cross-species transmission have been reported. To better understand the mode of transmission of HM, Garcia-Souto et al. analysed mitochondrial DNA (mtDNA) by next generation sequencing in different bivalve species collected in the Mediterranean Sea and the Atlantic Ocean. The authors found that HM isolated in Venus verrucosa contained mtDNA that actually matched Chamelea gallina. Analysis of the nuclear gene DEAH12 also showed single nucleotide polymorphisms (SNPs) matching C. gallina DNA. Based on mtDNA and DEAH12 sequences, the authors use Bayesian inference to generate phylogenetic trees showing that HM found in V. verrucosa is much closer to C. gallina than the host species. They conclude that HM propagated from C. gallina to V. verrucosa.

      Overall, the study is well performed with enough samples analysed. The results are quite convincing but there are also some concerns.

      1. Transmissible cancers are known to split into clades based on mtDNA differential rate of evolution and also to incorporate mtDNA from exogenous sources, so one has to be extra careful that the results prove cross-species transmission and not HM divergence into two clades and/or exogenous acquisition. Samples HM ERVV17-2997 and EMVV18-376, both at the N1 stage, appear devoid of C. gallinae mtDNA and do not appear to have been screened for DEAH12. One explanation for this result is that there are too few HM cells in the samples (but supplementary Figure 1 shows some HM cells in ERVV17-2997. However, a different explanation is that these samples contain V. verrucosae mtDNA. ERVV17-2997 and EMVV18-376 could have been analysed in greater depth to verify that they also contained C. gallinae mtDNA and typical DEAH12 SNPs.

      Answer: Despite the high sequencing coverage obtained for the sequenced individuals, we did not find foreign reads in the N1 tumours (ERVV17-2997 and EMVV18-73) to mitochondrial nor nuclear (i.e., DEAH12, TFHII) level. This is most likely due to a very low proportion of neoplastic cells in their tissues.

      Action: We have added a sentence on page 8 that discuss this issue.

      1. To strengthen their argument, the authors could have analysed a few more nuclear genes for specific SNPs, although the sensitivity of this approach will depend on the depth of sequencing.

      Answer: We obtained a preliminary nuclear assembly using short-reads only. Obviously, the resulting assemblies are fragmented and incomplete. This has limited the identification of candidate regions shared by the three genomes (V. verrucosa and both Chamelea clams). Out of the 44 candidate nuclear fragments we tested, only two (DEAH12 and TFHII) turned out to give good PCR products, adequate for Sanger sequencing. As mentioned above, we now provide additional data on a second gene (TFIIH), identified and selected on the same basis as DEAH12. Individual ML phylogenies for these two fragments evidenced that tumours cluster together and separately from the host species and, in the case of DEAH12, closer to C. gallina. The MSC phylogeny was rebuilt including this new nuclear fragment. 12 In addition, we conducted a comparative screening of tandem repeats on the genomes of C. gallina and V. verrucosa. Two DNA satellites, namely CL4 and CL17, of, respectively, 332 and 429 bp monomer size, were very abundant in C. gallina and in the tumoral animals, but absent from all healthy V. verrucosa specimens. FISH probes designed for these satellites mapped on the heterochromatic regions, mainly in subcentromeric and subtelomeric positions, of both C. gallina and the neoplastic metaphases found in V. verrucosa, but were absent from the normal metaphases of the host species V. verrucosa. These results were consistent with the genomic abundance of these satellites in the NGS data and strongly suggest that these chromosomes derive from C. gallina.

      Action: We include the analysis of one additional nuclear locus, TFIIH (pages 9-10). We have obtained new ML and MSC phylogenies including this new locus (pages 9-10, figures 3b-c). Additional FISH approach looking for satellite DNA CL4 and CL11 was performed (page 10, figure 3d, supplementary figure 5). The methods section has been updated accordingly (pages 20- 21, 23-24).

      1. It would have been interesting to have more information in the Discussion on the potential immunological barriers that this tumour needs to overcome for cross-species transmission.

      Answer: At a glance, we could argue/discuss that this transmissibility, inside or cross-species, is prone to occur in bivalves due to their filtering feeding system and the fact that their immune system is not entirely developed and yet to be completely understood, as the reviewer may know. Also, it would be tempting to suggest that some genetic restrictions allowing for cancer contagion happening only between close taxa might be in place, but, unfortunately we do not have the means to state that with our current data.

      Action: At this point, no specific action has been taken for this query. However, we are happy to include something in the discussion if the reviewer still thinks this is relevant for improving the manuscript.

    1. Author Response:

      Reviewer #1 (Public Review):

      Jo et al. use a combination of micropatterned differentiation, single cell RNA sequencing and pharmacological treatments to study primordial germ cell (PGC) differentiation starting from human pluripotent stem cells. Geometrical confinement in conjunction with a pre-differentiation step allowed the authors to reach remarkable differentiation efficiencies. While Minn et al. already reported the presence of PGC-like cells in micropatterned differentiating human cultures by scRNA-Seq (as acknowledged by the authors), the careful characterization of the PGC-like population using immunostainings and scRNA-Seq is a strength of the manuscript. The attempt at mechanistically dissecting the signaling pathways required for PGC fate specification is somehow weaker. The authors do not present sufficient evidence supporting the ability to specify PGC fate in the absence of Wnt signaling and the importance of the relative signaling levels of BMP to Nodal pathways; the wording of the text should be amended to better reflect the presented evidence or the authors should perform additional experiments to support these claims.

      We thank the reviewer for this comment. As described in more detail in the responses below, we have significantly strengthened the evidence for the rescue of Wnt inhibition by exogenous Activin treatment and have nuanced our interpretation. We believe that our data suggest low levels of Wnt may be required directly for PGC competence, while much higher levels are required indirectly to induce Nodal, with Nodal signaling being the limiting factor for PGC specification under the reference condition with BMP4 treatment only. We describe this in detail in the manuscript but summarize it here in a simplified diagram:

      We have also carried out additional experiments that match model predictions demonstrating the importance of relative BMP and Nodal signaling levels and amended the text to reflect the evidence as suggested. More details are provided below.

      The molecular characterization of why colonies confined to small areas differentiate much better would greatly increase the biological significance of the manuscript (the technical achievement of reaching such efficiency is impressive on its own).

      We believe the mechanism by which cells confined to small colonies differentiate to PGCLCs more efficiently is explained by a larger fraction of the cells being exposed to the necessary levels of BMP and Nodal signaling. In large colonies BMP signaling was shown to be restricted to a distance of 50-100 um from the colony edge through receptor localization and secretion of inhibitors (Etoc et al, Dev Cell 2016). From this one would expect that BMP signaling extends a similar distance from the edge in small colonies, so that a larger fraction of cells are receiving the BMP signal needed to differentiate to PGCLCs. Because it was not previously shown that the length scale of BMP signaling and downstream signals are preserved as colony size is reduced, we have now included an analysis of BMP signaling (pSmad1 levels) and Nodal signaling (nuclear Smad2/3 levels) as a function of colony size (Figure 5i-k). This confirms our hypothesis and provides a potential mechanism.

      The authors propose a mathematical model based on BMP and Nodal signaling that qualitatively recapitulate their experimental data. While the authors should be commended for providing examples of other simple models that do not fully recapitulate their data, it would have been nice to see an attempt at challenging quantitatively the model. In particular, the authors do not take advantage of the ability to explore in a more systematic manner the BMP/Nodal phase space with their system.

      We thank the reviewer for this suggestion. Experimentally we have now tested the effect of 5x5 = 25 different combinations of BMP and Activin doses on PGCLC differentiation. We then challenged the mathematical model to predict the ‘phase diagram’ corresponding to this data with good agreement (Figure 6f). It is important to note here that the model was fit using only data with 50ng/ml of BMP, making this a true prediction. We also point out that the phase diagram predicted in this way is different from the one shown in Figure 6d, not only because of the lower resolution, but because Figure 6f shows the steady state after uniform stimulation in space and time (i.e. the response on the very edge), whereas the predicted phase diagram shows average expression at 42h in a 100um range from the colony edge using the previously measured spatiotemporal gradients of BMP and Activin response. Finally, the data in Figure 6f shows mean expression levels as opposed to the percentage double positive cells for the same data in Figure 4q because our model does not simulate individual cells and noise, only allowing us to compare mean expression. We explain all this in the text now. As a minor change to facilitate comparison of data and model we have now plotted the concentrations of BMP and Activin in Figure 6 rather than the scaled model parameters from 0 to 1, we also further optimized the model parameters without qualitative changes.

      The authors' claim that PGCLC formation can be rescued by exogenous Activin when blocking endogenous Wnt production is surprising given the literature. The authors only show that they can restore a TFAP2C+SOX17+ population but do not actually stain for an established germ cell marker. It appears essential to perform a PRDM1 staining in these conditions (Figure 4A) to unambiguously identify this population.

      We have significantly extended our analysis of the effect of WNT inhibition and subsequent rescue of PGCs by Activin treatment. This includes staining for TFAP2C,NANOG,PRDM1 and staining for LEF1 as a measure of WNT signaling. Figure 4 and Figure 4—figure supplement 1 now also include treatment with IWR-1, a different small molecule inhibitor of WNT signaling, as well inhibition by IWR-1 and IWP2 at different times and different doses.

      The authors only provide weak evidence that the fates depend on the relative signaling levels of BMP and Nodal. Indeed, fewer cells acquire a fate the lower BMP concentration they use, including the fates marked by Sox17 expression. It would more convincing to show the assay of Figure 4F for a range of BMP concentrations at which the overall differentiation works sufficiently well.

      As suggested, we have now included a range of BMP concentrations. The reduction in PGCs at lower BMP doses is in line with our model and does not contradict a dependence on the relative signaling levels of BMP and Nodal by which we mean that optimal dose of Activin for PGCLC specification depends on the level of BMP and vice versa. We have amended the text to state this more clearly.

      References

      Chen, Di, Na Sun, Lei Hou, Rachel Kim, Jared Faith, Marianna Aslanyan, Yu Tao, et al. 2019. “Human Primordial Germ Cells Are Specified From Lineage-Primed Progenitors..” Cell Reports 29 (13): 4568–4582.e5. doi:10.1016/j.celrep.2019.11.083.

      Etoc, Fred, Jakob Metzger, Albert Ruzo, Christoph Kirst, Anna Yoney, M Zeeshan Ozair, Ali H Brivanlou, and Eric D Siggia. 2016. “A Balance Between Secreted Inhibitors and Edge Sensing Controls Gastruloid Self-Organization..” Developmental Cell 39 (3): 302–15. doi:10.1016/j.devcel.2016.09.016.

      Kobayashi, Toshihiro, Haixin Zhang, Walfred W C Tang, Naoko Irie, Sarah Withey, Doris Klisch, Anastasiya Sybirna, et al. 2017. “Principles of Early Human Development and Germ Cell Program From Conserved Model Systems..” Nature 546 (7658): 416–20. doi:10.1038/nature22812.

      Kojima, Yoji, Kotaro Sasaki, Shihori Yokobayashi, Yoshitake Sakai, Tomonori Nakamura, Yukihiro Yabuta, Fumio Nakaki, et al. 2017. “Evolutionarily Distinctive Transcriptional and Signaling Programs Drive Human Germ Cell Lineage Specification From Pluripotent Stem Cells..” Cell Stem Cell 21 (4): 517–532.e5. doi:10.1016/j.stem.2017.09.005.

      Sasaki, Kotaro, Tomonori Nakamura, Ikuhiro Okamoto, Yukihiro Yabuta, Chizuru Iwatani, Hideaki Tsuchiya, Yasunari Seita, et al. 2016. “The Germ Cell Fate of Cynomolgus Monkeys Is Specified in the Nascent Amnion..” Developmental Cell 39 (2): 169–85. doi:10.1016/j.devcel.2016.09.007.

      Tyser, R.C.V., Mahammadov, E., Nakanoh, S. et al. Single-cell transcriptomic characterization of a gastrulating human embryo. Nature 600, 285–289 (2021). https://doi.org/10.1038/s41586-021-04158-y

    1. Author Response

      Reviewer #1:

      This is a very timely paper that addresses an important and difficult-to-address question in the decision-making field - the degree to which information leakage can be strategically adapted to optimise decisions in a task-dependent fashion. The authors apply a sophisticated suite of analyses that are appropriate and yield a range of very interesting observations. The paper centres on analyses of one possible model that hinges on certain assumptions about the nature of the decision process for this task which raises questions about whether leak adjustments are the only possible explanation for the current data. I think the conclusions would be greatly strengthened if they were supported by the application and/or simulation of alternative model structures.

      We thank the reviewer for this positive appraisal of our study. We now entirely agree with their central comment about whether leak adjustments are the only (or even the best) explanation for the current data. We hope that the additional modelling sections that we have discussed in response to main comment 1 above have strengthened the paper. We have responded point-by-point to their public review, as this contained their main recommendations for revision.

      The behavioural trends when comparing blocks with frequent versus rare response periods seem difficult to tally with a change in the leak. […] Are there other models that could reproduce such effects? For example, could a model in which the drift rate varies between Rare and Frequent trials do a similar or better job of explaining the data?

      We can see why the reviewer has advocated for a possible change of drift rate (or ‘gain’ applied to sensory evidence) between conditions to explain our behavioural findings. We found, however, that changes in drift rate could elicit qualitatively similar changes in integration kernels to changes in decision threshold:

      Author response image 1.

      Changes in gain applied to incoming sensory evidence (A parameter in model) have similar effects on recovered integration kernels from Ornstein-Uhlenbeck simulation as changes in decision threshold.

      The likely reason for this is that the overall probability of emitting a response at any point in the continuous decision process is determined by the ratio of accumulated evidence to decision threshold. A similar logic applies to effects on reactions times and detection probability (main figure 2): increasing sensory gain/decreasing decision threshold will lead to faster reaction times and increased detection probability during response periods.

      Both parameters may even have a similar effect on ‘false alarms’, because (as the reviewer notes below) false alarms in our paradigm are primarily being driven by the occurrence of stimulus changes as well as internal noise. In fact, the false alarm findings mean it is difficult to fully reconcile all of our behavioural findings in terms of changes in a single set of model parameters in the O-U process. It is possible that other changes not considered within our model (such as expectations of hazard rates of inter-response intervals leading to dynamic thresholds etc.) may have had a strong impact upon the resulting false alarm rates. A full exploration of different variations in O-U model (with varying urgency signals, hazard rates, etc.) is beyond the scope of this paper.

      For this reason, we have decided in our new modelling section to focus primarily on a single, well-established model (the O-U process) and explore how changes in leak and threshold affect task performance and the resulting integration kernels. We note that this is in line with the suggestion of reviewer #2, who focussed on similar behavioural findings to reviewer #1 but suggested that we look at decision threshold rather than drift rate as our primary focus.

      This ties in to a related query about the nature of the task employed by the authors. Due to the very significant volatility of the stimulus, it seems likely that the participants are not solely making judgments about the presence/absence of coherent motion but also making judgments about its duration (because strong coherent motion frequently occurs in the inter-target intervals). If that is so, then could the Rare condition equate to less evidence because there is an increased probability that an extended period of coherent motion could be an outlier generated from the noise distribution? Note that a drift rate reduction would also be expected to result in fewer hits and slower reaction times, as observed.

      As mentioned above, the rare and frequent targets are indeed matched in terms of the ease with which they can be distinguished from the intervening noise intervals. To confirm this, we directly calculated the variance (across frames) of the motion coherence presented during baseline periods and response periods (until response) in all four conditions:

      Author response image 2.

      The average empirical standard deviation of the stimulus stream presented during each baseline period (‘baseline’) and response period (‘trial’), separated by each of the four conditions (F = frequent response periods, R = rare, L = long response periods, S = short). Data were averaged across all response/baseline periods within the stimuli presented to each participant (each dot = 1 participant). Note that the standard deviation shown here is the standard deviation of motion coherence across frames of sensory evidence. This is smaller than the standard deviation of the generative distribution of ‘step’-changes in the motion coherence (std = 0.5 for baseline and 0.3 for response periods), because motion coherence remains constant for a period after each ‘step’ occurs.

      Some adjustment of the language used when discussing FAs seems merited. If I have understood correctly, the sensory samples encountered by the participants during the inter-response intervals can at times favour a particular alternative just as strongly (or more strongly) than that encountered during the response interval itself. In that sense, the responses are not necessarily real false alarms because the physical evidence itself does not distinguish the target from the non-target. I don't think this invalidates the authors' approach but I think it should be acknowledged and considered in light of the comment above regarding the nature of the decision process employed on this task.

      This is a good point. We hope that the reviewer will allow us to keep the term ‘false alarms’ in the paper, as it does conveniently distinguish responses during baseline periods from those during response periods, but we have sought to clarify the point that the reviewer makes when we first introduce the term.

      “Indeed, participants would occasionally make ‘false alarms’ during baseline periods in which the structure of the preceding noise stream mistakenly convinced them they were in a response period (see Figure 4, below). Indeed, this means that a ‘false alarm’ in our paradigm has a slightly different meaning than in most psychophysics experiments; rather than it referring to participants responding when a stimulus was not present, we use the term to refer to participants responding when there was no shift in the mean signal from baseline.”

      And:

      “The fact that evidence integration kernels naturally arise from false alarms, in the same manner as from correct responses, demonstrates that false alarms were not due to motor noise or other spurious causes. Instead, false alarms were driven by participants treating noise fluctuations during baseline periods as sensory evidence to be integrated across time, and the physical evidence preceding ‘false alarms’ need not even distinguish targets from non-targets.”

      The authors report that preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods. It is not clear what identifies this signal as reflecting motor preparation. Did the authors consider using other effectorselective EEG signatures of motor preparation such as beta-band activity which has been used elsewhere to make inferences about decision bounds? Assuming that this central ERP signal does reflect the decision bounds, the observation that it has a larger amplitude at the response on Rare trials appears to directly contradict the kernel analyses which suggest no difference in the cumulative evidence required to trigger commitment.

      Thanks for this comment. First, we should simply comment that this finding emerged from an agnostic time-domain analysis of the data time-locked to button presses, in which we simply observed that the negative-going potential was greater (more negative) in RARE vs. FREQUENT trials. So it is simply the fact that it precedes each button press that we relate it to motor preparation; nonetheless, we note that (Kelly and O’Connell, 2013) found similar negative-going potentials at central sensors without applying CSD transform (as in this study). Like them, we would relate this potential to either the well-established Bereitschaftpotential or the contingent negative potential (CNV).

      We agree that many other studies have focussed on beta-band activity as another measure of motor preparation, and to make inferences about decision bounds. To investigate this, we used a Morlet wavelet transform to examine the time-varying power estimate at a central frequency of 20Hz (wavelet factor 7). We repeated the convolutional GLM analysis on this time-varying power estimate.

      We first examined average beta desynchonisation at a central cluster of electrodes (CPz, CP1, CP2, C1, Cz, C2) in the run-up to correct button presses during response periods. We found a reliable beta desynchonisation occurred, and, just as in the time-domain signal, this reached a greater threshold in the RARE trials than in the FREQUENT trials:

      Author response image 3.

      Beta desynchronisation prior to a correct response is greater over central electrodes in the RARE condition than in the FREQUENT condition.

      We agree with the reviewer that this is likely indicative of a change in decision threshold between rare and frequent trials. We also note that our new computational modelling of the O-U process suggests that this in fact reconciles well with the behavioural findings (changes in integration kernels). We now mention this at the relevant point in the results section:

      “As large changes in mean evidence are less frequent in the RARE condition, the increased neural response to |Devidence| may reflect the increased statistical surprise associated with the same magnitude of change in evidence in this condition. In addition, when making a correct response, preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods (Figure 7b; p=0.041, cluster-based permutation test). We found similar effects in beta-band desynchronisation prior, averaged over the same electrodes; beta desynchronisation was greater in RARE than FREQUENT response periods. As discussed in the computational modelling section above, this is consistent with the changes in integration kernels between these conditions as it may reflect a change in decision threshold (figure 2d, 3c/d). It is also consistent with the lower detection rates and slower reaction times when response periods are RARE (figure 2 b/c).”

      We did also investigate the lateralised response (left minus right beta-desynchronisation, contrasted on left minus right responses). We found, however, that we were simply unable to detect a reliable lateralised signal in either condition using these lateralised responses. We suspect that this is because we have far fewer response periods than conventional trialbased EEG experiments of decision making, and so we did not have sufficient SNR to reliably detect this signal. This is consistent with standard findings in the literature, which report that the magnitude of the lateralised signal is far smaller than the magnitude of the overall beta desynchronisation (e.g. (Doyle et al., 2005))

      P11, the "absolute sensory evidence" regressor elicited a triphasic potential over centroparietal electrodes. The first two phases of this component look to have an occipital focus. The third phase has a more centroparietal focus but appears markedly more posterior than the change in evidence component. This raises the question of whether it is safe to assume that they reflect the same process.

      We agree. We have now referred to this as a ‘triphasic component over occipito-parietal cortex’ rather than centroparietal electrodes.

      Reviewer #2:

      Overall, the authors use a clever experimental design and approach to tackle an important set of questions in the field of decision-making. The manuscript is easy to follow with clear writing. The analyses are well thought-out and generally appropriate for the questions at hand. From these analyses, the authors have a number of intriguing results. So, there is considerable potential and merit in this work. That said, I have a number of important questions and concerns that largely revolve around putting all the pieces together. I describe these below.

      Thanks to the reviewer for their positive appraisal of the manuscript; we are obviously pleased that they found our work to have considerable potential and merit. We seek to address the main comments from their public review and recommendations below.

      1) It is unclear to what extent the decision threshold is changing between subjects and conditions, how that might affect the empirical integration kernel, and how well these two factors can together explain the overall changes in behavior.

      I would expect that less decay in RARE would have led to more false alarms, higher detection rates, and faster RTs unless the decision threshold also increased (or there was some other additional change to the decision process). The CPP for motor preparatory activity reported in Fig. 5 is also potentially consistent with a change in the decision threshold between RARE and FREQUENT. If the decision threshold is changing, how would that affect the empirical integration kernel? These are important questions on their own and also for interpreting the EEG changes.

      This important comment, alongside the comments of reviewer 1 above, made us carefully consider the effects of changes in decision threshold on the evidence integration kernel via simulation. As discussed above (in response to ‘essential revisions for the authors’), we now include an entirely new section on how changes in decision threshold and leak may affect the evidence integration kernel, and be used to optimise performance across the different sensory environments. In particular, we agree with the reviewer that the motor preparatory activity that differs between RARE and FREQUENT is consistent with a change in decision threshold, and our simulations have suggested that our behavioural findings on evidence integration are also consistent with this change as well. These are detailed on pp.1-4 of the rebuttal, above.

      2) The authors find an interesting difference in the CPP for the FREQUENT vs RARE conditions where they also show differences in the decay time constant from the empirical integration kernel. As mentioned above, I'm wondering what else may be different between these conditions. Do the authors have any leverage in addressing whether the decision threshold differs? What about other factors that could be important for explaining the CPP difference between conditions? Big picture, the change in CPP becomes increasingly interesting the more tightly it can be tied to a particular change in the decision process.

      We fully agree with the spirit of this comment, and we’ve tried much more carefully to consider what the influences of decision threshold and leak would be on our behavioural analyses. As discussed in the response to reviewer 1, we think that the negative-going potential at the time of responses (which is greater in RARE vs. FREQUENT, main figure 7b, and mirrored by equivalent changes in beta desynchronisation, see Reviewer Response Figure 5 above) are both reflective of a change in decision threshold between RARE and FREQUENT conditions. We have tried to make this link explicit in the revised results section:

      “As large changes in mean evidence are less frequent in the RARE condition, the increased neural response to |Devidence| may reflect the increased statistical surprise associated with the same magnitude of change in evidence in this condition. In addition, when making a correct response, preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods (Figure 7b; p=0.041, cluster-based permutation test). We found similar effects in beta-band desynchronisation prior, averaged over the same electrodes; beta desynchronisation was greater in RARE than FREQUENT response periods. As discussed in the computational modelling section above, this is consistent with the changes in integration kernels between these conditions as it may reflect a change in decision threshold (figure 2d, 3c/d). It is also consistent with the lower detection rates and slower reaction times when response periods are RARE (figure 2 b/c).”

      I'll note that I'm also somewhat skeptical of the statements by the authors that large shifts in evidence are less frequent in the RARE compared to FREQUENT conditions (despite the names) - a central part of their interpretation of the associated CPP change. The FREQUENT condition obviously has more frequent deviations from the baseline, but this is countered to some extent by the experimental design that has reduced the standard deviation of the coherence for these response periods. I think a calculation of overall across-time standard deviation of motion coherence between the RARE and FREQUENT conditions is needed to support these statements, and I couldn't find that calculation reported. The authors could easily do this, so I encourage them to check and report it.

      See Author response image 2.

      3) The wide range of decay time constants between subjects and the correlation of this with another component of the CPP is also interesting. However, in trying to interpret this change in CPP, I'm wondering what else might be changing in the inter-subject behavior. For instance, it looks like there could be up to 4 fold changes in false alarm rates. Are there other changes as well? Do these correlate with the CPP? Similar to my point above, the changes in CPP across subjects become increasingly interesting the more tightly it can be tied to a particular difference in subject behavior. So, I would encourage the authors to examine this in more depth.

      Thanks for the interesting suggestion. We explored whether there might be any interindividual correlation in this measure with the false alarm rate across participants, but found that there was no such correlation. (See Author response image 4; plotting conventions are as in main figure 9).

      Author response image 4.

      No evidence of between-subject correlations in CPP responses and false alarm rates, in any of the four conditions.

      We hope instead that the extended discussion of how the integration kernel should be interpreted (in light of computational modelling) provides at least some increased interpretability of the between-subject effects that we report in figure 9.

      Reviewer #3 (Public Review):

      The main strength is in the task design which is novel and provides an interesting approach to studying continuous evidence accumulation. Because of the continuous nature of the task, the authors design new ways to look at behavioral and neural traces of evidence. The reverse-correlation method looking at the average of past coherence signals enables us to characterize the changes in signal leading to a decision bound and its neural correlate. By varying the frequency and length of the so-called response period, that the participants have to identify, the method potentially offers rich opportunities to the wider community to look at various aspects of decision-making under sensory uncertainty.

      We are pleased that the reviewer agrees with our general approach as a novel way of characterising various aspects of decision-making under uncertainty.

      The main weaknesses that I see lie within the description and rigor of the method. The authors refer multiple times to the time constant of the exponential fit to the signal before the decision but do not provide a rigorous method for its calculation and neither a description of the goodness of the fit. The variable names seem to change throughout the text which makes the argumentation confusing to the reader. The figure captions are incomplete and lack clarity.

      We apologise that some of our original submission was difficult to follow in places, and we are very grateful to the reviewer for their thorough suggestions for how this could be improved. We address these in turn below, and we hope that this answers their questions, and has also led to a significant improvement in the description and rigour of the methodology.

    1. Author Response

      Reviewer #2 (Public Review):

      I am not a specialist in cryo-EM, so cannot comment on the technicalities of the structure reconstruction or methods used. I thus focus on the conclusions and observations that the authors provide in the manuscript and their relevance to functional photosynthesis.

      The authors attempt to resolve the structure of PSII from Dunaliella and noticed that three types of PSII could be identified: two conformational states, and a stacked configuration. There is no doubt that these structures add to our current knowledge of PSII and that they exist in abundance upon solubilisation of the sample. My main issue however is the relevance to in vivo conditions, and the efforts to exclude the possibility that pigment loss and conformational states and stacking are a reflection of ex-vivo manipulations.

      Our compact model contains 202 Chls molecules while the stretched conformation contains 206 Chls. All of the differences in Chl binding are attributed to CP29. We have compiled a table enumerating the different CP29 structures currently available from plants and green alga at similar resolution to our work (Supplementary table 2). In the larger plant complexes (C2S2M2) CP29 contains 14 chls, while CP29 in smaller C2S2 complexes contains 10-13 chls, so it appears the some chl loss from CP29 is associated with the release of LHCIIM. In the green alga structures, CP29 contains less chls in general and shows a similar trend. The currently published structure most relevant to our work contains 8 chls (6KAC), a somewhat lower amount then both the compact and stretched models (9 and 11 chls, respectively). The stretched orientation, which is the closest match to the known PSII core arrangement, therefore contains more chls than comparable models. While the in-vivo configuration is not known in the sense that it could contain more chls, the current structure is apparently the closest representation of it.

      The presence of CP29 with lower chls content in the chlamy C2S2 (6KAC, which is in a stretched orientation) supports a conclusion that pigment loss from CP29 alone is not sufficient to trigger the stretch to compact transition although it is associated with it. In general, the precise orientation of CP29 is variable and seem to depend on the binding of additional LHCII, it is possible that some chl loss is accompanied with these changes in vivo.

      I see a number of questions pertaining to this work. Starting from the two conformations of PSII, compact and stretched, the authors say that both are highly active based on oxygen measurements at a saturating light intensity. In the meantime, they report large variations in the chl content and positions of the chlorophyll molecules in these structures (also compared to other known PSIIs). This gives the impression that one can lose two chlorophylls, and freely modify the distance between others without losing efficiency, certainly a risky conclusion. Are the samples highly active also in light-limiting conditions? It is thought that even tiny movements and alterations in chl-chl distances alter their coupling and spectral properties, how come the variations in this report are so huge? In other words, the assay tests the charge separation activity of the PSII RC in the preps, but not the light-harvesting efficiency.

      The chl content differences reported in this work amounts to 2%. In our opinion this represents quite a low variation in pigment content, which exist in virtually any experiment involving large complexes. We agree that measurements of activity in limiting light conditions are interesting, however this goes beyond the scope of the current work. Light harvesting efficiency in PSII is known to vary substantially as a result of additional mechanisms (NPQ in some of its forms), not associated with chl loss or gain. While the formation of quenching centers is attributed to small structural changes within specific pigment protein complexes, what we are showing in this work are structural changes between pigment protein complexes. These can affect transfer rates between the different complexes but are distinct from the structural changes thought to accompany the formation of quenching centers within specific pigment protein complexes.

      How does one ascertain that the lost chlorophyll molecules in CP29 are not a preparation error? Does slightly increasing the detergent concentration impact the proportion of stretched:compact forms?

      The effect of detergent concentration on the proportion of the different forms was not tested directly. However, we do not detect many differences in lipids or bound detergent molecules content between the two conformations, suggesting that for these “ligands” the differences are not substantial. We can only distinguish these two forms at the very last stages of data processing, at the present state of cryoEM cost and time availability, mapping the effect of detergent concentration on the different orientations is outside our reach.

      On a similar note, how do the authors exclude that a certain interaction with this type of grid impacts the distribution of these complexes? Is it identical to a biologically separate preparation of algae? In case of discoveries of this type, it is of high importance to exclude as many possibilities of non-native conditions or influences on the structure.

      It’s hard to completely exclude grid and sample preparation issues. However, we employed relatively standard grids and vitrification conditions. The observed complexes are embedded in vitrified ice and do not interact with the grid directly. The differences we observed are mainly in the orientations of the PSII cores, all the interactions between PSII subunits within each core are preserved and agree with previously published structures. Since the interactions within the core and between cores involve the same physical principles, we think its fairly conservative to think that the observed core orientations are not an artefact of sample preparation.

      I would further like to encourage the authors to elaborate on the CP29 phosphorylation. What is the proportion of PSIIcomp that are phosphorylated? I assume it is not 100%, as in this case, the authors would propose that this is the effect that modulates between compact and stretched architectures.

      Its difficult to estimate the proportion of observed phosphorylation/sulfinylation. To be detected in maps, most of the residues (above 50%) are probably modified. We attempted to estimate this by refining the atom occupancies of the Pi molecule on Ser84 and the oxygens attached to Cys218, both values suggested that about 70% of the complexes are modified. With regards to the possibility that these modifications can promote the formation of the compact state, we think that this is certainly a possibility, since these modifications were detected in this state and are in close proximity to each other. However, this can also result from the resolution differences of the maps and the structural implications of both modifications are hard to predict. At this point we prefer to note their existence without further interpretations.

      In line 290, the authors highlight the structural heterogeneity within the two groups' PSII conformations. I would like to see how does the distribution look like for all the structures together: are the two (stretched and compact) specifically forming two heterogenous distributions? Or is it possible that the distribution between the two is quasi-continuous? In other words, if the structures are not perfectly defined, how do the authors decide that two- and not more or less subtypes exist?

      We went back and refined the initial particle group (containing both compact and stretched orientations) using multibody with masks defining the two PSII monomers. This analysis showed the expected two peaks only in the first Principal components which accounted for ~38% of the variance in the dataset.

      Multibody refinement carried out on the combined particle dataset shows one very large PC accounting for about 38% of the variance and the presence of two distinct peaks in the particle distribution of the first PC.

      From this analysis it’s clear that there are two distinct classes in this particle set (as expected), as none of the other PC’s shows any signs of multiple peaks, this analysis suggests that two distinct models are the best representation of this eukaryotic PSII. Whether these are quasi continuous or distinct is more complex. There is continuity in this representation (particle distributions along PC), a different picture may appear if characters such as CP29 state are considered, but the size of CP29 and the remaining heterogeneity does not provide enough signal to carry out this classification at the moment.

      Considering the stacked PSII, I also have a few concerns. Contrary to previous studies the authors do not assign a functional role to the stacking beyond the structural aspect. This could be better backed by a discussion about the closest chlorophyll a molecules across the stacked PSII, which given the rather large distance shown in fig. 4L seems to be too large for any EET across the stromal gap.

      The closest chl-chl distance that we can measure in the stacked PSII dimer is ~54 Å, with most distances at the ~70 Å range, making EET between staked complexes very slow. We have added a statement clarifying this to our manuscript. In our opinion a structural role for the staked PSII dimer is more likely.

      There is a report that suggests the presence of some density between the stacked PSII - could the authors comment on the differences between it and their work? Are the angles and positions conserved between these types of stacks? https://doi.org/10.1038/s41598-017-10700-8

      We referred to Albanese et al, in our manuscript. We isolated the C2S2 complex from green alga, the analysis in Albanese et al was done on C2S2M1 complexes from pea and this can account for some of the differences. At any rate, our conclusion that we don’t find any evidence for protein linkers in the stacked complex is stated clearly. The angles described in Albanese et al are consistent with our analysis.

      Line 387, the authors state that due to the transient nature of the interactions across the stromal gap, the stacks could be "under-detected" in cryo-ET data. This statement is in my opinion misformulated. For once, the transient interaction argument would apply the same (if not more due to changing conditions induced by the purification process) to the single particle analysis performed in this paper. Second, tomographic volumes detect hundreds of PSII in a suspended state. Any transient interaction that adds up to 25% of particle population in a steady state cell should be clearly visible, while the in situ data suggests not more than random cross-stromal-gap orientations. Of course, this can be a specificity of Chlamydomonas or a particular growth condition. The statement used by the authors could be indeed converted into: the PSII stacks are over-detected in vitro, and it is certainly a simpler explanation for their presence. It is also important to mention that PSII stacking alone is not the only reason for grana architecture - stacking with the antenna of larger complexes, absent in the authors' preparation could also contribute to grana maintenance; and auxiliary proteins such as CURT help with this issue as well. Here a recent demonstration of the importance of minor antenna should probably be also cited: https://doi.org/10.1101/2021.12.31.474624

      We used the term “flexible” rather than “transient” to describe the interactions within the stacked PSII dimer. Our data (and tomographic data) do not contain any temporal component. When we used the term under-detected we refer to the fact that PSII is mainly detected by the luminal extrinsic subunits. The flexibility detected in our analysis may affect the concurrent visibly of these features in the PSII complexes making up an individual PSII stack. Specifically, Wietrzynski et al mainly analyze C2S2M2L2 complexes while our analysis only contained C2S2 complexes. It is likely that the different amount of bound LHCII affect PSII stacking as well. For example, Wietrzynski et al, show some overlap between LHCII complexes and little overlap between cores in the larger complexes they analyzed. We observe mainly core to core overlap with little LHCII overlap in the smaller C2S2, although we did not observe any states where LHC’s were not included in what appear to be the binding interface. We agree with the reviewer on the relevance Lhcb’s and CURT contributions to stacking but prefer to focus on what was directly demonstrated in our data. We clearly note that we are discussing in-vitro results.

      Taking these last thoughts, I would like to finish by mentioning one more thing - almost philosophical. The authors are certainly at the forefront of the booming cryoEM revolution in biology which is profoundly changing the way we understand the living. There is absolutely zero doubt that this powerful technique is of the highest interest. But a growing number of structures of photosynthetic complexes remain puzzling, in particular with regard to their abundance in vivo (such as the PSII stacks) and functional relevance. How do we ascertain that these interactions are not due to in vitro preparation (isolation from cells, solubilisation)? Which ways can we use to try to exclude this (simple) hypothesis? I suggest that at least a small extent of biological replicas - experiments performed on separate batches, in different technical conditions, with slightly altered solubilization conditions, and so on - could shed light on the nature of these structures and their occurrence in vivo. Technical reps of the freezing+analysis pipeline could also be tried to see the variability. This would strongly reinforce this manuscript and its conclusions, and while not completely unequivocal (the stacked PSII, for example, could form upon each purification), a quantification of the effects would be of high interest.

      We certainly share the reviewer hope of being able to conduct cause and effect cryoEM experiments covering a complete set of experimental parameters. This is still beyond reach in terms of time and cost. Within each cryoEM experiment, however, all the analysis is consistent and, more importantly, transparent with regards to image analysis, which is the most important factor in our opinion. Preparation artefacts are always a possibility but, in our opinion, cryoEM is not affected by them differentially compared to other techniques. As we mentioned above, the particles are being observed suspended in vitreous ice, this is not different, and one can say even better, then numerous low temperature spectroscopic observations on samples suspended in glass state or crystals obtained in the presence of high concentrations of various agents. One thing that validates structural studies are the chemical details (bond lengths and angles etc…) underlying every model which are consistence with known values to close tolerances.

      Reviewer #3 (Public Review):

      In this manuscript, Caspy et al. present a detailed structural analysis of eukaryotic photosystem II (PSII) isolated from the green alga Dunaliella salina. By combining single-particle cryo-EM with multibody refinement, the authors not only reveal a high-resolution (2.4Å) structure of the eukaryotic PSII, but also demonstrate alternate conformations and intrinsic flexibility of the overall complex. Stretched and compact conformations of the PSII dimer were readily identified within the single-particle dataset. From this structural analysis, the authors propose that excitation energy transfer properties may be modulated by changes in transfer distance between key chlorophyll molecules observed in different conformational states of the PSII dimer. Due to the high resolution of the maps obtained, the authors identify post-translational modifications and a sodium binding site based on the observed cryo-EM maps. Additionally, the authors analyze PSII complexes in stacked and unstacked configurations, and find that compact and stretched states also exist within the stacked PSII complexes. From their cryo-EM maps, the authors demonstrate that there is no direct protein-protein interaction between stacked PSII complexes, and rather propose a model wherein long-range electrostatic interactions mediated by divalent cations such as magnesium, can facilitate PSII stacking.

      The conclusions and models presented in the manuscript are mostly well justified by the data. The cryo-EM maps are high quality and the models appear generally well refined. However, some aspects of data processing and analysis, as well as the resultant conclusions need to be clarified.

      1) In general, it is not clear from the cryo-EM processing workflow (suppl. Fig 1) or the methods section when exactly symmetry was applied during 3D classification and refinement. In the case of C2S2 unstacked particles, when was symmetry first applied in the overall processing workflow? To identify the compact and stretched configurations of C2S2, did the 3D classification without alignment (and/or the refinement preceding this classification) have C2 symmetry applied? If so, have you considered the possibility that some particles may actually be asymmetric in some regions?

      We modified figure S1 to clearly indicate the use of symmetry and particle expansion. In general, we refined most of the particle sets without symmetry (C1). At the final processing stage of the unstacked PSII sets, after we separated both conformations, we used C2 symmetry to expand the data, this was followed by multibody refinement. No symmetry or symmetry expansion was used for the stacked PSII particle sets.

      2) Following multibody refinement in Relion individual maps and half-maps for each body will be generated. There is no mention in the methods of how these individual maps for each C2S2 "monomer" were combined to produce an overall map of the dimer following multibody refinement. There are several methods currently used to combine such maps, including taking the maximum or average of the two maps or using a model-based approach in phenix. The authors should be explicit about the method they used, any potential artifacts that may develop from this map combination process, and/or the interface between masks used in multibody refinement.

      We used phenix.combined_focused_maps to combine the maps. This is now indicated in the method section.

      3) In addition to the point raised above, following multibody refinement there will be an individual FSC curve and resolution for each body. However, in supplemental figure 2 and supplemental table 1, only a single FSC curve and resolution are reported. Are these FSC curves/resolutions only reported for the better of the two bodies? If not, how was a single resolution calculated for the overall map of combined bodies?

      Both FSC curves were calculated and were highly similar, as expected following C2 expansion. This can also be evaluated from the local resolution maps which are highly similar between the two bodies. The reported resolutions are all taken from the displayed FSC curves generated through relion PostProcess.

      4) One of the major conclusions from the 3D classification and multibody refinement is that conformational changes and inherent flexibility of the PSII dimers have the potential to change distances between cofactors in the complex, ultimately leading to altered excitation energy transfer. However, it is unclear whether or not the authors believe one conformation over another may more readily support the evolution of oxygen. It would be nice if the authors could elaborate slightly upon this topic in the discussion.

      As discussed above the structural changes associated with the formation of quenching centers are not expected to be detected in the current work. The changes we observe can however affect the transfer to such centers and by doing so can play an important part in PSII biology. We do not detect any changes around the OEC and we don’t find any reason to think the two conformations are different with respect to their ETC.

      5) Along the lines of point 4 above, on line 95 the authors claim that "the high specific activity of 816 umol O2/ (mg Chl * hr) suggest that" both the C2S2 compact and stretched conformation are highly active. However, it is not clear to me why this measure of specific activity would suggest that both PSII conformations should have "high" activity. Maybe a reference here would help guide readers to previous measures of specific activity?

      Looking at specific activity from previously published structural studies on eukaryotic PSII we find that Sheng et al, 2019 reported on a specific activity of 272 mol O2/ (mg Chl * hr), this difference can stem partially from the presence of larger complexes in their preparation and is comparable to the activity that we measured in our As fraction (276 mol O2/ (mg Chl * hr), Figure 1-figure supplement 9). Reported specific activity values from plants (Pisum sativum) are also similar, Su et al, reported on a maximal value of 288 mol O2/ (mg Chl * hr), again, for larger complexes which can explain some of the difference. However, the specific activity measured for the C2S2 PSII isolated in the current study is 2.8 X higher than this value, more than the differences in chl content which ranges between 1.5 X to 2 X in favor of the larger complexes. If either one of the conformations is not as active, it would only mean that the other conformation will display even higher specific activity which seems less likely. In addition, we find no difference around the oxygen evolution center or in the peripheral luminal subunits in both the shape or map strength so both orientations show highly similar structures around these regions which determine the oxygen evolution activity.

      6) It is claimed that "more than 2100 water molecules were detected in the C2S2 compressed model", and the water distribution is shown in Figure 3. Obtaining resolutions capable of visualizing waters with cryo-EM is still a significant challenge. Upon visual inspection of the map supplied, it appears that several of the waters that were built into the atomic model simply do not have supporting peaks in the coulomb potential map above the level of noise. While some of the modeled waters are certainly supported by the map, in my opinion, there are many waters that simply are not, or at best are questionable. What method or tool was originally used to build waters into the model, and how were these waters subsequently validated during structure refinement?

      We followed standard methods for water placement and refinement in the preparation of the model, in addition to manually curating the water structure. However, in light of the reviewer comment we undertook additional rounds of refinement and inspection of the water molecules in the model. We removed a few hundred water molecules so that the total number of water molecules is now around 1700. All the water molecules in the present model should be well supported at maps values higher then 2.5 sigma and in our opinion the current water model should be regarded as conservative and underestimates the number of bound water molecules. This also led to some improvements in additional validation statistics of the model which are listed in the Table 1. The new model has been deposited in the PDB and the new PDB validation report is included in our resubmission.

      7) The authors claim to identify several unique map densities during model building. One of these is a sodium ion close to the OEC, which is coordinated by D1-His337, several backbone carbonyls, and a water molecule. When looking closely at the cryo-EM map supplied, it appears that the coulomb potential map is quite weak for this sodium, and is only visible at quite low contour levels. In fact, the features for the coordinating water, and chloride ions located ~7-9A away are much stronger than the sodium. Do the authors have any explanation for why the cryo-EM map is significantly weaker for the sodium compared to the coordinating water or chloride ions in the same general vicinity? Similar to what they did for the other post-translational modifications, the authors should consider showing the actual cryo-EM map for the bound sodium in supplemental Figure 10 a,b.

      Our main support for the placement of a Na+ ion in this location stems from the analysis of Wang et al. Our maps show the presence of a density which is discernible at 4 σ with an elongated shape suggesting the presence of multiple atoms/waters. Although in principle positive ions should have very strong densities in cryoEM maps due to their interactions with electrons, other factors such as occupancy, coordination and b-factor also play a role making the distinction between water and sodium complicated and case specific. The sodium peak is not observed in unsharpened maps (as do most of the water molecules which occupy conserved positions).

        We collected a few examples from comparable cases (cryo-EM maps of similar resolution ranges) where the presence of sodium ions is highly probable based on additional evidence. These maps densities highlight the factors we discussed above. In cases ‘a’ (dual oxidase 1 prepared in high sodium conditions) and ‘b’ (human voltage-gated sodium channel), Na+ is observed in a highly coordinated states and especially in ‘a’ shows the expected increase density values compared to water molecules. However, cases ‘d’ (human Na+/K+ P type Atpase) and ‘e’ (voltage-gated sodium channel) appear very similar to the proposed Na+ assignment in PSII. We conclude that map density alone is not enough to distinguish between Na+ and water molecules and rely on the additional experiments described by Wang et al. which show increase PSII activity in elevated Na+ levels in basic conditions.

      8) The cryo-EM maps showing CP29-Ser84 phosphorylation and CP47-Cys218 sulfinylation are quite convincing. However, it is interesting that these modifications are only observed in the compact conformation, and not in the stretched conformation. Can the authors elaborate on whether or not they believe the compact and stretched conformations could be a result of these posttranslational modifications, or vice versa?

      This is an interesting suggestion. In our opinion it is less likely that the modification themselves trigger the transition between compact and stretched states. It is not clear how these modifications will stabilize the compact vs the stretched states. It is equally likely that these modifications are somehow triggered by the structural change. We cannot be certain that these modifications are not present in the stretched orientation as well but remain unobserved due to resolution differences. The correlation between the states and post translation modifications should be verified before a discussion on their possible roles in the transitions.

      9) Do the authors believe that PSII dimers in the solution can readily interconvert between compact and stretched conformations? Or is the relative ratio of these conformations fixed at the time of membrane solubilization with decyl-maltoside?

      We think that its more probable that the transition between these states occur in the membrane phase. The main reason for this will be that pigment loss and structural transitions in CP29 are more likely to occur in the membrane rather than in aqueous/micelle environments.

      10) The model proposed for divalent cation-mediated stacking of PSII dimers is compelling, and seems to be in agreement with previous investigations that observed a lack of stacked dimers in cryo-EM preparations lacking calcium/magnesium. However, my understanding from reading the methods section is that the observed lack of density between the stacked PSII dimers was inferred from maps obtained after multibody refinement. Based on the way the masks to define bodies were created for multibody refinement (Fig. 4A), the region between stacked dimers would be highly prone to map artifacts following multibody refinement. Have the authors looked closely at the interfacial region between stacked dimers following conventional 3D classification/refinement to ensure that there are indeed no features observed in the interfacial region even at low contour levels?

      We’ve made several attempts to resolve differences in the space between the stacked PSII dimer. These include focused classification with masks containing selected volumes from this regions and masks that include only one of the stacked PSII dimers to avoid signal subtraction in this region. All of these did not reveal any discernible features in this region. In addition, any stable binding of a bridging protein across the stacked dimer will probably be at least partially visible as additional density over the unstacked PSII. We searched for such features and found none.

    1. Author Response:

      Reviewer #1:

      This manuscript by Gabor Tamas' group defines features of ionotropic and metabotropic output from a specific cortical GABAergic cell cortical type, so-called neurogliaform cells (NGFCs), by using electrophysiology, anatomy, calcium imaging and modelling. Experimental data suggest that NGFCs converge onto postsynaptic neurons with sublinear summation of ionotropic GABAA potentials and linear summation of metabotropic GABAB potentials. The modelling results suggest a preferential spatial distribution of GABA-B receptor-GIRK clusters on the dendritic spines of postsynaptic neurons. The data provide the first experimental quantitative analysis of the distinct integration mechanisms of GABA-A and GABA-B receptor activation by the presynaptic NGFCs, and especially gain insights into the logic of the volume transmission and the subcellular distribution of postsynaptic GABA-B receptors. Therefore, the manuscript provides novel and important information on the role of the GABAergic system within cortical microcircuits.

      We have made all changes humanely possible under the current circumstances and we are open to further suggestions deemed necessary.

      Reviewer #2:

      The authors present a compelling study that aims to resolve the extent to which synaptic responses mediated by metabotropic GABA receptors (i.e. GABA-B receptors) summate. The authors address this question by evaluating the synaptic responses evoked by GABA released from cortical (L1) neurogliaform cells (NGFCs), an inhibitory neuron subtype associated with volume neurotransmission, onto Layer 2/3 pyramidal neurons. While response summation mediated by ionotropic receptors is well-described, metabotropic receptor response summation is not, thereby making the authors' exploration of the phenomenon novel and impactful. By carrying out a series of elegant and challenging experiments that are coupled with computational analyses, the authors conclude that summation of synaptic GABA-B responses is linear, unlike the sublinear summation observed with ionotropic, GABA-A receptor-mediated responses.

      The study is generally straightforward, even if the presentation is often dense. Three primary issues worth considering include:

      1) The rather strong conclusion that GABA-B responses linearly summate, despite evidence to the contrary presented in Figure 5C.

      2) Additional analyses of data presented in Figure 3 to support the contention that NGFCs co-activate.

      3) How the MCell model informs the mechanisms contributing to linear response summation.

      These and other issues are described further below. Despite these comments, this reviewer is generally enthusiastic about the study. Through a set of very challenging experiments and sophisticated modeling approaches, the authors provide important observations on both (1) NGFC-PC interactions, and (2) GABA-B receptor mediated synaptic response dynamics.

      The differences between the sublinear, ionotropic responses and the linear, metabotropic responses are small. Understandably, these experiments are difficult – indeed, a real tour de force – from which the authors are attempting to derive meaningful observations. Therefore, asking for more triple recordings seems unreasonable. That said, the authors may want to consider showing all control and gabazine recordings corresponding to these experiments in a supplemental figure. Also, why are sublinear GABA-B responses observed when driven by three or more action potentials (Figure 5C)? It is not clear why the authors do not address this observation considering that it seems inconsistent with the study's overall message. Finally, the final readout – GIRK channel activation – in the MCell model appears to summate (mostly) linearly across the first four action potentials. Is this true and, if so, is the result inconsistent with Figure 5C?

      GABAB responses elicited by three and four presynaptic NGFC action potentials were investigated to have a better understanding about the extremities of NGFC-PC connection. Although, our spatial model suggests that in L1 in a single volumetric point one or two NGFCs could provide GABAB response with their respective volume transmission, it is still important that in the minority of the percentage three or more NGFCs could converge their output. The experiments in Fig 5 not only offer mechanistic understanding that possible HCN channel activation and GABA reuptake do not influence significantly the summation of metabotropic receptor-mediated responses, but also support additional information about the extensive GABAB signaling from more than two NGFC outputs. Interestingly in this experiment the summation until two action potentials show very similar linear integration as seen in the triplet recordings. This result suggests that the temporal and spatial summation is identical when limited inputs are arriving to the postsynaptic target cell. Similar summation interaction can be seen in our model until two consecutive GABA releases. Three or four consecutive GABA releases in our model still produces linear summation, our experiments show moderate sublinearity. One possible answer for this inconsistency is the vesicle depletion in NGFCs after multiple rapid release of GABA, which was not taken into account in our model.

      Presumably, the motivation for Figure 3 is that it provides physiological context for when NGFCs might be coactive, thereby providing the context for when downstream, PC responses might summate. This is a nice, technically impressive addition to the study. However, it seems that a relevant quantification/evaluation is missing from the figure. That is, the authors nicely show that hind limb stimulation evokes responses in the majority of NGFCs. But how many of these neurons are co-active, and what are their spatial relationships? Figure 3D appears to begin to address this point, but it is not clear if this plot comes from a single animal, or multiple? Also, it seems that such a plot would be most relevant for the study if it only showed alpha-actin 2-positive cells. In short, can one conclude that nearby, presumptive NGFCs co-activate, and is this conclusion derived from multiple animals?

      The aim of Fig. 3 D was to indicate that the active, presumably NGFCs are spatially located close to each other. The figure comes from a single animal. We agree with the reviewer, therefore changed the scatter plot figure in Fig. 3D to another one, that provides information about the molecular profiles of the active/inactive cells. We made an effort to further analyze our in vivo data and the spatial localization of the monitored interneurons (see Author response image 3.). The results are from 4 different animals, in these experiments numerous L1 interneurons are active during the sensory stimulus, as shown in the scatter plot. We calculated the shortest distance between all active cells and all ɑ-actinin2+ that were active in experiments. The data suggest that in the case of identified active ɑ-actinin2+ cells, the interneuron somas were on average 182.69+60.54 or 305.135+34.324 μm distance from each other. Data from Fig. 2D indicates that the average axonal arborization of the NGFCs is reaching ~200-250μm away. Taken these two data together, in theory it is probable that the spatial localization would allow neighboring NGFCs to directly interact in the same spatial point.

      The inclusion of the diffusion-based model (MCell) is commendable and enhances the study. Also, the description of GABA-B receptor/GIRK channel activation is highly quantitative, a strength of the study. However, a general summary/synthesis of the observations would be helpful. Moreover, relating the simulation results back to the original motivation for generating the MCell model would be very helpful (i.e. the authors asked whether "linear summation was potentially a result of the locally constrained GABAB receptor - GIRK channel interaction when several presynaptic inputs converge"). Do the model results answer this question? It seems as if performing "experiments" on the model wherein local constraints are manipulated would begin to address this question. Why not use the model to provide some data – albeit theoretical – that begins to address their question?

      We re-formulated the problem to be addressed in this Results section. We admit that our model is has several limitations in the Discussion and, consequently, we restricted its application to a limited set of quantitative comparisons paired to our experimental dataset or directly related to pioneering studies on GABAB efficacy on spines vs shafts. We believe that a proper answer to the reviewer’s suggestion would be worth a separate and dedicated study with an extended set of parameters and an elaborated model.

      In sum, the authors present an important study that synthesizes many experimental (in vitro and in vivo) and computational approaches. Moreover, the authors address the important question of how synaptic responses mediated by metabotropic receptors summate. Additional insights are gleaned from the function of neurogliaform cells. Altogether, the authors should be congratulated for a sophisticated and important study.

      Reviewer #3:

      The authors of this manuscript combine electrophysiological recordings, anatomical reconstructions and simulations to characterize synapses between neurogliaform interneurons (NGFCs) and pyramidal cells in somatosensory cortex. The main novel finding is a difference in summation of GABAA versus GABAB receptor-mediated IPSPs, with a linear summation of metabotropic IPSPs in contrast to the expected sublinear summation of ionotropic GABAA IPSPs. The authors also provide a number of structural and functional details about the parameters of GABAergic transmission from NGFCs to support a simulation suggesting that sublinear summation of GABAB IPSPs results from recruitment of dendritic shaft GABAB receptors that are efficiently coupled to GIRK channels.

      I appreciate the topic and the quality of the approach, but there are underlying assumptions that leave room to question some conclusions. I also have a general concern that the authors have not experimentally addressed mechanisms underlying the linear summation of GABAB IPSPs, reducing the significance of this most interesting finding.

      1) The main novel result of broad interest is supported by nice triple recording data showing linear summation of GABAB IPSPs (Figure 4), but I was surprised this result was not explored in more depth.

      We have chosen the approach of studying GABAB-GABAB interactions through the scope of neurogliaform cells and explored how neurogliaform cells as a population might give rise to the summation properties studied with triple recordings. This was a purposeful choice admittedly neglecting other possible sources of GABAB-GABAB interactions which possibly take place during high frequency coactivation of homogeneous or heterogeneous populations of interneurons innervating the same postsynaptic cell. We agree with the reviewer that the topic of summation of GABAB IPSPs is important and in-depth mechanistic understanding requires further separate studies.

      2) To assess the effective radius of NGFC volume transmission, the authors apply quantal analysis to determine the number of functional release sites to compare with structural analysis of presynaptic boutons at various distances from PC dendrites. This is a powerful approach for analyzing the structure-function relationship of conventional synapses but I am concerned about the robustness of the results (used in subsequent simulations) when applied here because it is unclear whether volume transmission satisfies the assumptions required for quantal analysis. For example, if volume transmission is similar to spillover transmission in that it involves pooling of neurotransmitter between release sites, then the quantal amplitude may not be independent of release probability. Many relevant issues are mentioned in the discussion but some relevant assumptions about QA are not justified.

      Indeed, pooling of neurotransmitter between release sites may affect quantal amplitude, therefore we examined quantal amplitude under low release probability conditions using 0.7- 1.5 mM [Ca]o to detect postsynaptic uniqantal events initiated by neurogliaform cell activation (Author response image 7). This way we measured similar quantal current amplitudes comparing with BQA method with no significant difference (4.46±0.83 pA, n=4, P=0.8, Mann-Whitney Test).

      3) The authors might re-think the lack of GABA transporters in the model since the presence and characteristics of GATs will have a large effect on the spread of GABA in the extracellular space.

      We agree that the presence of GAT could effectively shape the GABA exposure, e.g. (Scimemi 2014). During the development of the model, we took into consideration different possibilities and solutions to create the model’s environment. To our knowledge, there is no detailed electron microscopic study that would provide ultrastructural measurements of structural elements around the NGFC release sites and postsynaptic pyramidal cell dendrites in layer 1 while preserving the extracellular space. Moreover, quantitative information is scarce about the exact localization and density of the GATs along the membrane surface of glial processes around confirmed NGFC release sites. We felt that developing a functional environment that would contain GABA transporters without possessing such information would be speculative. Furthermore, during the development of the model it became clear that incorporating thousands of differentially located GABA transporters would massively increase the processing time of single simulations including monitoring each interaction between GATs and GABA molecules, and requiring computational power calculating the diffusion of GABA molecules in the extracellular space, even if GABA molecules are far from the postsynaptic dendritic site without any interaction.

      As an admittedly simple and constrained alternative, we decided to set a decay half-life for the GABA molecules released. This approach allows us to mimic the GABA exposure time of 20-200 ms, based on experimental data (Karayannis et al 2010). In the model the GABA exposure time was 114.87 ± 2.1 ms with decay time constants of 11.52 ± 0.14 ms. After ~200 ms all the released GABA molecules disappeared from the simulation environment.

      A detailed extracellular diffusion aspect was out of the scope of our model, we were interested in investigating how the subcellular localization of receptors and channels determine the summation properties.

      4) I'm not convinced that the repetitive stimulation protocol of a single presynaptic cell shown (Figure 5) is relevant for understanding summation of converging inputs (Figure 4), particularly in light of the strong use-dependent depression of GABA release from NGFCs. It is also likely that shunting inhibition contributes to sublinear summation to a greater extent during repetitive stimulation than summation from presynaptic cells that may target different dendritic domains. The authors claim that HCN channels do not affect integration of GABAB IPSPs but one would not expect HCN channel activation from the small hyperpolarization from a relatively depolarized holding potential.

      Use-dependent synaptic depression of NGFC induced postsynaptic responses was nicely documented by Karayannis and coworkers (2010) although they investigated the GABAA component of the responses and they found that the depression is caused by the desensitization of postsynaptic GABAA receptors. We are not aware of experiments published on the short term plasticity of GABAB responses. In our experiments represented in Fig 5 we found linearity in the summation of GABAB responses up to two action potentials and sublinearity for 3 and 6 action potentials. In fact, our results show that no synaptic depression is detectable in response to paired pulses since amplitudes of the voltage responses were doubled compared to a single pulse which means that the paired pulse ratio is around 1. To verify our result, we repeated our dual recording measurements with one, two, three and four spike initiation in the presynaptic neurogliaform cell (Author response image 6). Measuring both the amplitude and the overall charge of GABAB responses we again found linear relationship among one and two spike initiation protocol.

      Author response image 6 - Integration of GABAB receptor-mediated synaptic currents (A) Representative recording of a neurogliaform synaptic inhibition on a voltage clamped pyramidal cell. Bursts of up to four action potentials were elicited in NGFCs at 100 Hz in the presence of 1 μM gabazine and 10 μM NBQX (B) Summary of normalized IPSC peak amplitudes (left) and charge (right). (C) Pharmacological separation of neurogliaform initiated inhibitory current.

    1. Author Response:

      Reviewer #1 (Public Review):

      Overall the work is an impressive analysis of an understudied cell-type in human MS, and represents an important finding. The paper is well presented and the figures very clear. However, the manuscript is descriptive and, although this is not a problem by itself, the depth and limitations of the Cytof (only 37 markers) leaves the reader without a clear idea of what these cells could be doing.

      Some single-cell RNAseq and other ways to interrogate potential mechanisms and function would be particularly helpful here, but is perhaps beyond the scope of the paper.

      We thank the reviewer for this nice comment. We fully agree that a next informative step would be the investigation of the function and mechanisms of the NK cell populations in MS pathology. At this moment, that is indeed beyond the scope of the current manuscript. We do believe that our findings can guide future studies to explore potential mechanisms of NK cells in more depth.

      At minimum more immunohistochemical and smFish or in situ hybridization to validate key findings (using the markers identified by CyTOF) and add to the spatial relationships of Nk Cells with other border and brain cells would be informative.

      We appreciate this suggestion and have performed different immunohistochemical analysis to study the spatial relationship of NK cells and other immune and brain cells in the MS brain (Essential Revisions Fig. 1). We have stained the same cohort described in the manuscript for CD45, NKp46, GrB and Iba1 as well as CD45, NKp46, GrB and GFAP, to study the interaction of NK cells with microglia/macrophages and astrocytes, respectively, and with CD45+ immune cells in general. In MS lesions, we were able to detect a small but similar percentage of putative CD56bright NK cells (CD45+ NKp46+ GrB- cells) interacting with CD45+ Iba1- cells and with CD45+ Iba1+ cells (Essential Revisions Figure 1a-b). Due to astrogliosis, the processes of astrocytes densely populate the MS lesions and as such, we cannot infer if the interaction between NK cells and astrocytes is functional. Furthermore, the absolute number of NK cells in control brains is low, so we can only obtain reliable data from MS brains. As a result, we are unable to compare the observed interactions in MS lesions with a control condition. Of note, CD56bright NK cells are potent cytokine producers and their potential regulatory functions are not be limited to contact-dependent interactions.

      Essential Revisions Fig. 1 cellular interactions of Granzyme B- NK cells (a) Representative immunohistochemical staining of Granzyme B- NK cells stained for CD45 (green), NKp46 (magenta) and negative for Granzyme B (cyan), together with microglia stained with iba1 (red). Scale bar = 10µm. (b) Pie chart displays the percentage of CD45+ NKp46+ Granzyme Bcells interacting with CD45+ Iba1+ and C45+ Iba1- cells in MS lesions. (c) Representative immunohistochemical staining of NK cells stained for CD45 (green), NKp46 (magenta) and negative for Granzyme B (cyan), together with astrocytes stained with GFAP (red). Scale bar = 10µm.

      A major weakness of the study is that is is underpowered and thus not clear how robust or representative these findings are in MS given the heterogeneity of the disease and also potential differences in Sex, Age and lack of healthy controls. (AD samples labelled as control.)

      We thank the reviewer for their comment. First we would like to comment on the presumed lack of healthy controls. In this study, we included two ‘control’ groups, one of them consisted out of non-neurological controls (“NNC”), free of any neurological disease, and the other consisted of neurological controls (“NC”), including demented and Alzheimer patients. We acknowledge that this terminology leaves the reader confused; as such, we renamed the “NC” group with patients suffering from dementia to “Dementia” and the “NNC” group of donors without neurological disease to “Controls”.

      Secondly, while our sample size is rather small, it is comparable to other studies that use fresh post-mortem brain tissue (Böttcher et al, 2020).. The usage of this unique postmortem brain tissue from human donors is severely limited by the number of well-characterized samples available, their demographics and clinical background. To overcome the underpowered design and possible effects of confounders as sex and age, we validated our main finding by multiplex immunohistochemistry in a separate cohort. This included 5 controls (2 females, 3 males, f:m ratio of 0.667) and 7 MS cases (3 females and 4 males, f:m of 0.75), with a similar female/male ratio and matched age (Wilcoxon rank sum test with continuity correction, p-value = 0.41). We now included the characteristics of the validation cohort in the manuscript as well.

      “Finally, to confirm that CD56bright NK cells accumulate in periventricular brain regions in MS donors, we used multiplex immunohistochemistry in an independent cohort (Table 1), wherein MS and control groups were age-matched (Wilcoxon rank sum test with continuity correction, p-value = 0.41) and had a similar female:male ratio (0.667 in controls and 0.75 in MS).”

      Böttcher C, van der Poel M, Fernández-Zapata C, Schlickeiser S, Leman JKH, Hsiao CC, Mizee MR, Adelia, Vincenten MCJ, Kunkel D, Huitinga I, Hamann J, Priller J (2020) Single-cell mass cytometry reveals complex myeloid cell composition in active lesions of progressive multiple sclerosis. Acta neuropathologica communications, 8(1), 1-18

      It is also important to show the NK cells are actually in the parenchyma and interacting with other cells (e.g., microglia) of the lesion. If the authors have this tissue and antibodies to do that, this would add to the study. Moreover, the details on samples and controls should be more clearly communicated in the text and legends as well as the caveats and limitations of the study in the Discussion.

      The location of NK cells within the brain parenchyma is an important determinant of their function within the CNS. Thus, we included a basement membrane marker (collagen IV) in our multiplex IHC panel in order to exclude the cells within the vessel lumen. As this has not been clearly communicated, we have adjusted the sentence from the subsection Multiplex immunohistochemistry in the Methods (from “Cells within the lumen of vessels from the choroid plexus sections were excluded manually” to “Cells within the lumen of vessels were excluded manually with the aid of collagen IV staining.”). We have addressed in Essential Revisions Fig. 1 the additional IHC experiments performed to explore the interactions of NK cells with other brainresident cells. We thank the reviewer for warning us on the difficulty of our nomenclature. We have thus adjusted the labels of the three main groups throughout the manuscript as follows: Control (previously, NNC), Dementia (previously, NC) and MS (same as before). We also have expanded the limitations of this study in the Discussion.

      “Our study has two main limitations, first scarcity of fresh human tissue prevented having sex and age-matched groups with large sample sizes for the CyTOF analysis. To overcome the underpowered design and possible effects of confounders, we have validated our main finding by multiplex immunohistochemistry in a separate cohort with a similar age and female/male ratio. Secondly, there is a strong contribution of blood-derived immune cells in the choroid plexus, which precluded a clear distinction between circulating and stromal immune cells. This may have prevented the detection of choroid-plexus specific changes in the stroma, such as an accumulation of CD8+ T cells in the choroid plexus from MS donors, previously described by our group using immunohistochemistry [47]. In addition, the high proportion of granulocytes in the CP as detected by our CyTOF analysis likely originates from the circulation [47,63]. Contrariwise, the scarcity of B cells, despite the high vascularisation, is in line with previous reports [47,63]; and the detection of rare ASCs in the choroid plexus but not in the blood reassures their tissue specificity [63].”

      Reviewer #2 (Public Review):

      The data are extensive, valuable, convincing, and entirely descriptive (as studies using human post-mortem material must be, of necessity). What emerges is a detailed account of NK cells in specific regions of the MS brain (although here the authors slightly overplay how little is known about NK cells in MS). The study provides a very comprehensive resource. The authors speculate on what their data might mean in terms of disease dynamics is a reasonable and informed way, but much of what is concluded is inference not backed up by experiment studies that would allow this to be more than a resource paper.

      We thank the reviewer for his/her compliments and agree that in this manuscript we can only speculate on the role of NK cells and their way of migration or proliferation, to and within the brain. Only future research can solve these speculations. We have addressed these concerns accordingly in the discussion and have removed any concluding or far-fetched speculations which is not backed-up by our own data.

      Reviewer #3 (Public Review):

      The authors introduce their work in the context of the prevailing uncertainties about the pathogenesis of multiple sclerosis (MS) and, in particular, seem to reference the initiation of immune lesions in early MS. However, the work itself addresses end-stage MS situations, which is quite possibly an entirely different landscape altogether, and may not be informative about MS initiation.

      We want to thank the reviewer for pointing out this misleading part of the text. We agree that our study does not provide any information on the initial stages of MS, and have therefore adjusted this part of the introduction to avoid confusion. “Brain regions around the ventricles are hotspots for MS lesions [8,21,39,52], but underlying mechanisms are poorly understood [41]. Since the majority of periventricular MS lesions occur around a central vessel [1,57], it has been suggested that vascular topography may influence MS pathology [33].”

      As a textual point, the manuscript makes far too many speculations about possible cell trafficking between compartments than is justified by a cross-section study.

      We appreciate this concern and we have therefore tuned down our speculations in the results and discussion sections.

      That said, the work itself is a carefully done descriptive characterisation of the leucocyte landscape found in the periventricular septum, choroid plexus (and peripheral blood) post-mortem from cases of multiple sclerosis (MS), non-MS neurological disease (dementia), and non-neurological controls (8-12 each). The material is rare, the post-mortem delays are quite short, the cell lineage characterisation is fairly extensive and some of the data are well supported by immunohistochemistry.

      We thank the reviewer for these compliments.

    1. Author Response:

      Reviewer #4 (Public Review):

      In this work, Tee et al. study the implications of Heparan Sulfate (HS) binding mutations observed on the Enterovirus A71 (EV-A71) capsid. HS-binding mutations are observed for several virus infections and are often presumed to be a cell culture adaptation. However, in the case of EV-A71, the presence of HS-binding mutations in clinical samples and the contradictory findings in animal studies have made the clinical relevance of HS-binding a subject of debate. Therefore, to better understand the role of HS-binding in EV-A71, the authors use a mouse-adapted EV-A71 variant (MP4) and compare it to a cell-adapted strong HS-binder (MP4-97R/167G). Using these two variants, the authors show that the strong HS-binder does not require acidification for uncoating and genome release. Furthermore, it is demonstrated that the capsid stability of the HS-binding variant is compromised, resulting in pH-independent uncoating. Overall, this study provides new insights demonstrating that seemingly beneficial mutations increasing viral replication may be counterbalanced by other unintended consequences.

      Strengths:

      The thoroughness of the experiments performed to demonstrate that the HS-binding phenotype results in pH-independent entry and capsid destabilisation is worth highlighting. In this regard, the authors have explored viral entry using a range of approaches involving lysosomotropic drugs, viral binding assays, and neutral red-labelled viruses coupled with diverse techniques such as FISH, RNAscope, and transient expression of constitutively active molecules to inhibit parts of the viral cycle. In my opinion, this is necessary to rule out the other downstream effects of the lysomotropic drugs and to confirm the role of the HS-binding mutation in the entry phase. The use of in silico analysis coupled with negative staining electron microscopy and environmental challenge assays is notable. Finally, the demonstration of some of the work using a human-relevant strain is commendable.

      We appreciate the reviewer recognition of the significance of our study and the precious advises.

      Weaknesses:

      A major weakness in this study is the focus on using a mouse-adapted EV-A71 strain (MP4). In the introduction, it is argued that HS-binding mutations are controversial due to their occurrence in cell culture. However, due to host limitations, mice are not the natural hosts for EV-A71 and thus, the same argument can be made for a mouse-adapted strain. It is not clear how different this strain is from circulating EV-A71 strains and the relevance of these findings to the human situation is questionable. This is particularly made evident in the discussion where it is highlighted that HS-binding variants (VP1-145G/Q mutants) have been associated with severe neurological cases while the same variants show attenuated phenotypes in mice and monkeys. This contrast between clinical data and animal studies should be highlighted in the introduction, rather than later in the discussion, as currently the in vivo animal studies are presented as the optimal situation and may lead to misconstrued conclusions from the results.

      As requested by the reviewer, we included new experiments performed with a clinical strain isolated in an immunosuppressed patient (Cordey et al., 2012). We compared the sensitivity of this human strain harboring or not the VP1 L97R and E167G mutations to HCQ and confirmed that the similar differential sensitivity to HCQ was observed as with the MP4 variant. This result is presented as a new supplementary figure (Figure 6-figure supplement 1) and is described in the result section of the revised manuscript (Page 7, lines 251).

      Page 7, lines 251: To determine if our observations are applicable to human strains, we examined the sensitivity of a closely related clinical strain. This strain was isolated from the respiratory tract of an immunosuppressed patient with a disseminated EV-A71 infection27. Additionally, we tested a strong HS-binding derivative that harbors the same VP1-L97R and E167G mutations as our MP4 double mutant. Notably, this human clinical strain shares 98.3% amino acid similarity with the MP4 variant used in this study and exhibits similar HS-binding phenotypes28. As shown in Figure 6-figure supplement 1, the original human strain was inhibited by HCQ, whereas the double mutant exhibited insensitivity to the drug.

      We also added the comment about discrepancy between clinical data and animal studies in the introduction as requested (page 2, lines 69-76): However, epidemiological surveillance of human EV-A71 infections19-21 and experimental evidence from 2D human fetal intestinal models22, human airway organoids23 and air-liquid interface cultures24 suggest that HS binding may enhance viral replication and virulence in humans. In addition, recent research has shown that EV-A71 can be released and transmitted via cellular extrusions25 or exosomes26, potentially preventing viral trapping of HS-binding strains in the circulation. Further studies are required to evaluate the true impact of HS-binding mutations on the spread and virulence of EV-A71 in both animal models and humans.

      An important consideration is that the results are based primarily on image analysis. The inclusion of RT-qPCR and/or plaque assays as supplementary data will help strengthen the findings.

      We have performed RT-qPCR to confirm the immunostaining data and included them in the supplementary data (Figure 1-figure supplement 1E). Reference to these data is made in the result section [Page 4, lines 114-116: These results were confirmed by viral load quantification with real-time RT-PCR (Figure 1-figure supplement 1E).]

      Moreover, there are suggestions of an intermediate binder having a different phenotype. As this intermediate binder is the clinical phenotype, data on the entry of this intermediate binder will be valuable.

      While we agree with reviewer that the single mutant is an intermediate binder and exhibits a clinical phenotype, we made the decision to work with variants that display clear phenotypes, selecting MP4 and the double mutant, as the latter is fully attenuated in both immunocompetent and immunosuppressed mice (Weng et al., 2023). Additionally, we performed an experiment using HCQ, where we observed an intermediate effect with the single mutant. This further confirmed our decision to proceed with MP4 and the double mutant for all experiments. The data supporting this are shown in Author response image 1, which we are sharing exclusively with the reviewer.

      Author response image 1.

      Differential sensitivity of MP4, MP4-97R and MP4-97R167G to Lysosomotropic drugs

      Another weakness in the study is the lack of contextualization of the results to current EV-A71 literature. For instance, SCARB2 is referred to as the internalization receptor but a recent study has shown that SCARB2 is not required for internalization (https://doi.org/10.1128%2Fjvi.02042-21). The findings from this study are consistent with the localization of SCARB2 in the lysosomal membranes. Furthermore, the same study has highlighted host sulfation as a key factor in EV-A71 entry. Post-translational sulfation introduces negatively charged residues on host proteins including HS and SCARB2. This increases the binding of HS-binding strains to these proteins. In this regard, the reduced infectivity upon soluble SCARB2 treatment may simply be due to enhanced binding rather than capsid opening as suggested in the results. Therefore, additional experiments (e.g. nSEM following soluble SCARB2 treatment) must be performed to support the conclusion of capsid opening, due to inherent instability, upon SCARB2 binding.

      We apologize for not citing this relevant literature excluding the role of SCARB2 in viral attachment. We have now included these references in the revised version of the manuscript. (Page 2, lines 54-56: “Since SCARB2 is mostly localized on endosomal and lysosomal membrane and sparsely on plasma membrane3,5, it seems to play only a minor role in EV-A71 cell attachment6,7.

      We thank the reviewer for mentioning the possibility that the sulfation of SCARB2 may enhance its binding to the mutated virus compared to the wild-type virus, potentially explaining the selective competitive inhibition of this variant by soluble SCARB2 produced in mammalian cells. To investigate this hypothesis, we performed nsEM imaging of the double mutant incubated with soluble SCARB2 and we observed an increase in the proportion of empty capsids in the presence of soluble SCARB2 (4% versus 0.7%), supporting our original findings that the inactivation is indeed associated with capsid opening. The results are included in the revised manuscript in Figure 5-figure supplement 4 and described on Page 7, lines 243-245: “However, the double mutant exhibited a ~5-fold increase in empty capsid percentage after treatment with sSCARB2 (Figure 5-figure supplement 4), consistent with the functional data above.”

      In addition to the above, other existing literature on EV-A71 pathogenesis using organoids contradicts some of the explanations of differential phenotype in clinical observations versus mice models. In the introduction, it is suggested that reduced neurovirulence of HS-binding strains is due to binding to the vascular endothelia. However, the correlation of clinical severity to viremia (https://doi.org/10.1186/1471-2334-14-417) and the association of HS-binding mutants to clinical disease counteract this suggestion. Similarly, viral infection in human organoids with EV-A71 results in as low as 0.4% of the cells being infected (https://doi.org/10.1038/s41564-023-01339-5). In this case, if viral binding to (ubiquitously expressed) HS results in viral trapping then the HS-binding mutants should show lowered infectivity in organoid models rather than the observed higher infectivity (https://doi.org/10.3389/fmicb.2023.1045587, https://doi.org/10.1038/s41426-018-0077-2). Finally, EV-A71 release has also been shown to occur in exosomes (https://doi.org/10.1093%2Finfdis%2Fjiaa174) which effectively provides a protective lipid membrane. These recent findings must be incorporated into the article and will help better contextualize their findings.

      We appreciate the reviewer thoughtful comments. We do not believe that the correlation between clinical severity and viremia contradicts the viral trapping hypothesis. For strains that do not bind to HS, the absence of viral trapping could indeed lead to higher viral concentrations in the bloodstream, potentially increasing neurovirulence. However, we agree with the reviewer that other observations in humans, along with experimental data from more relevant models such as organoids, challenge the trapping hypothesis. We are grateful for the suggested citations and have incorporated these references in the introduction, where we discuss this point in more detail

      Page 2, lines 69-76: “However, epidemiological surveillance of human EV-A71 infections19-21 and experimental evidence from 2D human fetal intestinal models22, human airway organoids23 and air-liquid interface cultures24 suggest that HS binding may enhance viral replication and virulence in humans. In addition, recent research has shown that EV-A71 can be released and transmitted via cellular extrusions25 or exosomes26, potentially preventing viral trapping of HS-binding strains in the circulation. Further studies are required to evaluate the true impact of HS-binding mutations on the spread and virulence of EV-A71 in both animal models and humans.”

      Overall, the authors present new findings with convincing methodology. The manuscript can be improved in the contextualization of the findings and highlighting the weakness in translating these findings to resolve the debate surrounding the relevance of HS-binding phenotype. The inclusion of additional experiments and data recommended to the authors will also help strengthen the manuscript.<br />

    1. Author Response:

      Reviewer #1 (Public Review):

      This manuscript investigates the gene regulatory mechanisms that are involved in the development and evolution of motor neurons, utilizing cross-species comparison of RNA-sequencing and ATAC-sequencing data from little skate, chick and mouse. The authors suggest that both conserved and divergent mechanisms contribute to motor neuron specification in each species. They also claim that more complex regulatory mechanisms have evolved in tetrapods to accommodate sophisticated motor behaviors. While this is strongly suggested by the authors' ATAC-seq data, some additional validation would be required to thoroughly support this claim.

      Strengths of the manuscript:

      1) The manuscript provides a valuable resource to the field by generating an assembly of the little skate genome, containing precise gene annotations that can now be utilized to perform gene expression and epigenetic analyses. The authors take advantage of this novel resource to identify novel gene expression programs and regulatory modules in little skate motor neurons.

      2) Cross-species RNA-seq and ATAC-seq data comparisons are combined in a powerful approach to identify novel mechanisms that control motor neuron development and evolution.

      Weaknesses:

      1) It is surprising that the analysis of RNA-seq datasets between mouse, chick, and little skate only identified 5 genes that are common between the 3 species, especially given the authors' previous work identifying highly conserved molecular programs between little skate and mouse motor neurons, including core transcription factors (Isl1, Hb9, Lhx3), Hox genes and cholinergic transmission genes. This raises some questions about the robustness of the sequencing data and whether the genes identified represent the full transcriptome of these motor neurons.

      To address reviewer #1’s questions, we have generated RNA sequencing data with mouse forelimb MNs and re-analyzed the RNA-seq data using only the homologous MN populations (Figure 3) among different species. As a result, many genes (1038 genes) are commonly expressed in MNs in different species, including many known MN marker genes. In the result section, we have added the following:

      “The evolution of genetic programs in MNs was investigated unbiasedly by comparing highly expressed genes in pec-MNs (percentile expression > 70) of little skate with the ones from MNs of mouse and chick, two well-studied tetrapod species. In order to compare gene expression with homologous cell types from each species, we performed RNA sequencing on forelimb MNs of mouse embryos at embryonic day 13.5 (e13.5) and wing level MNs of chick embryos at Hamburger-Hamilton (HH) stage 26–27…”

      We have also compared our re-analysis with previous results in Figure 2–figure supplement 1, shown above. Most of the fin MN genes (21/24) are highly expressed in pecMNs (percentile > 70), consistent with the previous in situ experiments. In the Results we have added the following:

      “Although the total number of DEGs are different from the previous data (592 vs. 135 genes in pec-MN DEGs), which might be caused by different statistical analysis with different reference genome, previous RNA-seq data based on de novo assembly and annotation using zebrafish was mostly recapitulated in our DEG analysis based on our new skate genome (21 out of 24 previous fin MN marker genes have the expression level ranked above 70th percentile in Pec-MNs; Figure 2‒figure supplement 1).”

      2) The authors suggest based on analysis of binding motifs in their ATAC-seq data that the greater number of putative binding sites in the mouse MNs allows for a higher complexity of regulation and specialization of putative motor pools. This could certainly be true in theory but needs to be further validated. The authors show FoxP1 as an example, which seems to be more heavily regulated in the mouse, but there is no evidence that FoxP1 expression profile is different between mouse and skate. It is suggested in Fig.5 that FoxP1 might be differentially regulated by SnaiI in mouse and skate but the expression of SnaiI in MNs in either species is not shown.

      We have added further discussion and data about differential expression of Foxp1 in mouse and little skate in Figure 5–figure supplement 16 and have discussed as follows:

      “Foxp1, the major limb/fin MN determinant appears to be differentially regulated in tetrapod and little skate. Although Foxp1 is expressed in and required for the specification of all limb MNs in tetrapods, Foxp1 is downregulated in Pea3 positive MN pools during maturation in mice (Catela et al., 2016; Dasen et al., 2008). In addition, preganglionic motor column neurons (PGC MNs) in the thoracic spinal cord of mouse and chick express half the level of Foxp1 expression than limb MNs. Although PGC neurons have not yet been identified in little skate, we tested the expression level of Foxp1 using a previously characterized tetrapod PGC marker, pSmad. We observed that Foxp1 is not expressed in MNs that express pSmad (Figure 5‒figure supplement 3). Since there is currently no known marker for PGC MNs in little skate, our conclusion should be taken with caution.”

      As for Snai1, in the revision we performed a motif enrichment analysis with an unbiased gene list where Snai1 didn’t show up. However, when we performed an RNA in situ hybridization experiment for Snai1 (Figure 5–figure supplement 3), we found that Snai1 is expressed in MNs of both mouse and little skate, but not in chick, which has been shown previously (Cheung et al., 2005). In order to examine the function of Snai1 in the regulation of Foxp1 expression, we ectopically expressed Snai1 in chick spinal cord by performing in ovo electroporation. However, we did not detect any changes in Foxp1. Instead we observed an increase in the number of neurons and abnormal MN exits from the spinal cord, which is the reminiscent of a previous observation (Zander et al., 2014). Although we did not detect any changes in Foxp1 expression, we cannot rule out the possibility that Snai1 regulates Foxp1 in mouse and little skate, which may require a gene knock out experiment. Because binding sites of Snai1 were not enriched in the new gene sets that we analyzed in the revision, we have not further discussed the Snai1 in the text.

      3) In their discussion section the authors state that they found both conserved and divergent molecular markers across multiple species but they do not validate the expression of novel markers in either category beyond RNA-seq, for example by in situ or antibody staining.

      We have added RNA in situ hybridization results in Figure 3C and Figure 3–figure supplement 1 and 2. Most of the genes were expressed in tissues in accordance with the sequencing results (6 out of 9 common MN genes; 4 out of 6 mouse specific genes; 5 out of 7 skate specific genes). Specifcally, Uchl1, Slc5a7, Alcam, and Serinc1 are expressed in MNs of all three species; Coch, Ppp1rc, Ctxn1, and Clmp are expressed in MNs of mouse but not in MNs of other species; Eya1, Etv5, Dnmbp, and Spint1 are expressed in MNs of skate but not in MNs of other species. In the result section, we have summarized the results as follow:

      “These results were validated by performing RNA in situ hybridization in tissue sections on a subset of species-specific genes …”

    1. Author Response

      Reviewer #2 (Public Review):

      Regulation of NAD and its intermediary metabolites is of critical importance in axon degeneration and neurodegenerative disease. Mounting evidence supports a scenario in which low NAD, and high NMN triggers axon degeneration by competitive allosteric inhibition/activation of SARM1. Strategies to increase NAD levels and/or lower NMN levels provide neuroprotection in a variety of contexts. NAD metabolism is a partially conserved process, however, there are key differences in pathway routes and dynamics between model organisms used for NAD research (yeast, worm, fly, zebrafish, mouse/mammalian systems). Drosophila is a key model organism for axon degenerative research based on its ease of use and range of available genetic tools, in addition, the effector of axon degeneration - SARM1 - was first identified in the fly. As Drosophila has some key differences in the NAD synthesis pathways to mammalian systems it is important to test and develop tools to enable exploration of these pathways on the fly. Llobet Rosell and colleagues have developed clear and demonstrable tools in Drosophila for exploring NAD-related axon degenerative pathways by modulating the use of NMN via the addition of NMN consuming and NMN generating enzymes. They utilize Drosophila genetics to adequately support the claims made in the manuscript. Importantly, the authors well-demonstrate that consuming NMN through an alternate route to NaMN provides neuroprotection and that the neuroprotective components of low NMN are upstream of SARM1. These should be useful tools for neuroscientists in the future to use Drosophila for neurodegenerative research.

      Strengths:

      • Clear demonstration that low NMN provides neuroprotection using novel, stable, enzymatic depletion of NMN (to NaMN).

      • Development of a novel Drosophila tool (NMN-D transgenics) to explore NMN metabolism in vivo, including a stabilized version to permit chronic NMN depletion.

      • Metabolomic profiles across the pathway to show all pathway changes (not just isolated NMN or NAD assays). • Neurodegenerative assays that include both histological outcomes (axon degeneration) but also circuitry/functional outcomes. Data from both series of experiments all support each other.

      • Assessment of other known potent axon degenerative genes via genetics in combination with the tools developed. • Staging of the molecular processes by strategic ablation of the inhibitory ARM domain on SARM1 (dSarm deltaARM). These experiments suggest that low NAD AND high NMN (i.e. ratio between the two) is the critical factor that drives axon degeneration. Once NAD is low, axon degeneration cannot be recovered by further lowering of NMN. The dSarm delta-ARM and dnmnat sgRNAs experiments support a hypothesis in that (high) NMN triggers, but doesn't, execute axon degeneration.

      We appreciate his recognition of the quality of our research.

      Weaknesses:

      • The authors use murine NAMPT (mNAMPT) to increase NMN. The degeneration assays support the hypotheses made, yet mNAMPT doesn't actually increase NMN. Thus it is unclear in this setting whether mNAMPT promotes axon degeneration by an NMN-related mechanism or through another route. It is also unclear as to why the murine form was chosen versus a human or other orthologues, or changing the metabolism of the intrinsic pathway (NR and NRK).

      Why mNAMPT:

      We decided to use mouse NAMPT (mNAMPT) because it was readily available by Giuseppe Orsomando (Amici et al., 2017), and because we did not have access to human NAMPT (hNAMPT).<br /> We agree with the observation that under physiological conditions, the expression of mNAMPT does not change NMN. However, we argue that after injury, once dNmnat is degraded, the additional NMN synthesis provided by mNAMPT expression (in addition to dNrk), leads to a faster NMN accumulation. It is supported by the observation that NMNAT2 is more labile than NAMPT in mammals (Gilley and Coleman, 2010; Stefano et al., 2015).

      • The authors use metabolic profiling to look at the individual metabolites during axon degenerative evens and treatments however it is unclear if any of these proteins or genes change as a consequence. This is likely not important for understanding the findings however, might be helpful in explaining the mNAMPT data.

      We agree with the idea to test whether there is a change induced at the mRNA or protein level when the metabolic flux is altered. To do this, first, we measured the relative expression levels of axon death and NAD+ synthesis genes (Figure 2 – figure supplement 1B). Then, we measured potential changes upon mNAMPT expression (Figure 4 – figure supplement 1). Importantly, while the Gal4-driven expression resulted in an increase of relative mNAMPT transcript abundance from 30 to 12’000, the change observed in the other genes was not notable. Importantly, compared to Actin–Gal4, dnrk is 2-fold lower in UAS-mNAMPT and Actin > mNAMPT backgrounds (control vs. experiment, respectively). Thus, overall, there appears to be no change in mRNAs of either axon death or NAD+ synthesis genes.

      In the results, we changed the text accordingly:

      "We then tested the effect of mNAMPT on the NAD+ metabolic flux in vivo. Surprisingly, NAM, NMN, and NAD+ levels remained unchanged under physiological conditions (Figure 4C). However, we noticed 3-fold higher NR and a moderate but significant elevation of ADPR and cADPR levels upon mNAMPT overexpression (Figure 4C). We also asked whether mNAMPT impacts on NAD+ homeostasis thereby altering the expression of axon death or NAD+ synthesis genes. Besides the expected significant increase in the Gal4-mediated expression of mNAMPT, we did not observe any notable changes at the mRNA level (Figure 4 – figure supplement 1)."

      • The authors repeatedly introduce a novel PncC antibody. However, no details on this, its generation, or its testing are found within the manuscript as presented. The antibody detects with several bands. The authors speculate that this could be a degradation product but nothing substantial is shown.

      In Materials and methods, we added a new section:

      "PncC antibody generation Rabbit anti-PncC antibodies were generated by Lubioscience under a proprietary protocol. The immunogen used was purified from Escherichia coli, strain K12, corresponding to the full protein sequence of NMN-D. The amino acid sequence is the following: MTDSELMQLSEQVGQALKARGATVTTAESCTGGWVAKVITDIAGSSAWFERGFVTYSNEAKAQMIGVREETLAQHGAVSEPVVVEMAIGALKAARADYAVSISGIAGPDGGSEEKPVGVWFAFATARGEGITRRECFSGDRDAVRRQAT AYALQTLWQQFLQNT"

      We also updated the results referencing it.

      "We found that both wild-type and enzymatically dead NMN-D enzymes are equally expressed in S2 cells, as detected by newly generated PncC antibodies (Materials & Methods, Figure 1–figure supplement 2). Notably, we observed two immunoreactivities per lane, with the lower band being a potential degradation product."

      In addition, we now provide evidence why we believe that the upper band is NMN-D, while the lower one is a degradation product. In the figure attached below, the samples of the first five lanes were denatured at 70 °C, while the samples of the last two lanes were denatured at 95 °C (each for 10 min, respectively). The resulting Western blot shows that at 70 °C, there is more unspecific background, but no lower degradation product, while at 95 °C, the background is drastically reduced; however, there is a lower degradation product appearing. NMN-D is indicated by an asterisk. We feel that it is important to show this data here in the rebuttal. But we feel that it would add confusion to the readers in the manuscript.

      • Olfactory receptor neuron degeneration assays are shown in Fig1 but no data is presented with it to support the images.

      We agree that a quantification would support our observation. However, it is difficult to precisely quantify individual axons in the ORN injury assay, for two main reasons:

      1. Severed axons are often bundled, thus the exact number cannot be scored.

      2. Due to the removal of the cell body, the axonal GFP intensity decreases over time, due to the absence of mCD8::GFP synthesis. It adds another level of difficulty. Nevertheless, we added numbers to each example in Figure 1E and D, where we quantified the % of brains where severed preserved axons were observed, similar to Figure 2 in (MacDonald et al., 2006).

      In the results section, we changed the text as indicated below:

      "We extended the ORN injury assay and found preservation at 10, 30, and 50 dpa (Figure 1E). While quantifying the precise number of axons is technically not feasible, severed preserved axons were observed in all 10, 30, and 50 dpa brains, albeit fewer at later time points (MacDonald et al., 2006). Thus, high levels of NMN-D confer robust protection of severed axons for multiple neuron types for the entire lifespan of Drosophila."

      In the Figure 1 legend, we changed the text accordingly:

      "D Low NMN results in severed axons of olfactory receptor neurons that remain morphologically preserved at 7 dpa. Examples of control and 7 dpa (arrows, site of unilateral ablation). Lower right, % of brains with severed preserved axon fibers. E Low NMN results in severed axons that remain morphologically preserved for 50 days. Representative pictures of 10, 30, and 50 dpa, from a total of 10 brains imaged for each condition (arrows, site of unilateral ablation). Lower right, % of brains with severed preserved axon fibers."

    1. Author Response

      eLife assesssment:

      This paper conducts human and rodent experiments of non-invasive diffusion MRI estimates of axon diameter with the aim to establish whether these estimates provide biologically specific markers of axonal degeneration in MS. It will be of interest to researchers developing quantitative MRI methods and scientists studying neurodegeneration. The experiments provide evidence for the sensitivity of these markers, but do not directly validate axon diameter and do not reflect common pathological mechanisms across rodents and humans.

      We thank the Editor for the appreciation of our work. Thanks to the addition of an extensive electron microscopy paradigm, we now include a direct validation of axonal damage and expand on the common pathological mechanisms across the two species. The new results are detailed in the manuscript and summarized in Fig. 3 in the manuscript

      Reviewer #1 (Public Review):

      1.1 My primary concern relates to how meaningful the human-rodent comparisons are, and whether these comparisons really advance our understanding of AxCaliber estimates in MS. I applaud the aim to conduct "matched" experiments in both rodent models and human disease. It is a strength that the experiments are aligned with respect to the MRI measurements (although there are some caveats to this mentioned below). But beyond that, the overlap is not what one might hope for: the pathology would seem to be very distinct in humans and rodents, and the histological validation is not specific to what the MRI measurements claim to estimate. To summarize the main findings: (i) in a rat model of general axonal degeneration, axon calibre estimates correlate with neurofilaments; (ii) in MS in humans, axon calibre estimates correlate with demyelinating lesions. This gives a picture of AxCalibre estimates correlating with neuropathology, but is this something that has not already been established in the literature? If the aim is to validate AxCaliber, then there is a logic in using a rodent model that isolates alterations to axonal radius, but what then does this add to the existing literature in that space? If the aim is to study MS (for which AxCaliber results have been previously reported in Huang et al), then why not use a rodent model of MS?

      We thank the reviewer for their very insightful comments. Indeed, multiple sclerosis (MS) is a chronic neuroinflammatory and neurodegenerative disease of unknown etiology. An enormous effort has been made to obtain animal models that simulate the pathogenesis of this disease. However, while several models exist recapitulating distinct aspects of the disease (mostly related to demyelination), MS fundamentally remains a disease that only affects humans. This does not mean that EAE or lysolecithin models do not provide information on specific aspects and are therefore valuable. In fact, we believe that trying to replicate the pathological mechanisms of this disease in an animal model goes beyond the scope of the present work. In this work, our intention is to validate a biomarker of axonal damage preclinically, and for this, we use a model of axonal degeneration. We do not claim that this model should be valid to capture the complex clinical and pathological manifestation of MS, but we do think that it is a necessary step to ensure MRI sensitivity to axonal pathology. Why necessary? Because all the available (very limited) MRI literature which provides some form of validation: i) only focuses on healthy tissue, and ii) has an n of 1. Our preclinical paradigm gives conclusive evidence that the MRI axonal diameter proxy detects axonal damage as an increase in the mean diameter. This is now detailed in the discussion.

      After this necessary preclinical validation, we then apply the same framework to a human disease like MS that, among other manifestations, is believed to also cause axonal pathology. The improvements with respect to the one published work about axonal diameter in MS are: i) the whole brain analysis, which allowed us to characterize the extent of these early alterations outside the demyelinated lesions; and ii) the larger sample size, which allowed us to uncover an association with disease duration, strengthening our hypothesis about increased axonal diameter being a marker of early disease (new Fig. 5).

      Regarding the nonspecificity of histological validation, we thank the reviewer for this insightful comment, which triggered an additional analysis that we believe has added further value to the paper. Using electron microscopy, we found that in our model of neurodegeneration, axonal damage is indeed reflected as an increase in axon diameter (new Fig. 3). These recent findings strongly support the validation of our noninvasive diffusion MRI estimates of axon diameter alterations as an early-stage hallmark of normal-appearing tissue in MS.

      Coming back to the comparison between pathology in humans and in rodents, the EM data also support our choice of preclinical model, showing axonal swelling, the same phenomenon reported and characterized in recent postmortem histological data in the normal-appearing white matter of MS patients (Luchicchi et al., Ann Neurol 2021) and in lesions (Fisher et al. Ann Neurol 2007).

      All in all, we are confident that the new data supports the validity of this translational approach, and shed new light into the degenerating aspect of MS.

      Changes in the manuscript

      • Discussion, pag.12: It is important to stress that the aim of this work is not to propose a new animal model of MS, a disease that only affects humans, but rather to validate axonal damage detection (independently from the pathology that has induced it) through noninvasive MRI and apply the framework to characterize axonal pathology in MS.

      1.2 I appreciate that both rodent and patient studies are time intensive, major endeavors. Neverthless, the number of subjects is very low in both rodent (n=9) and human (MS=10, control=6) studies. At the very least, this should be more openly acknowledged. But I'm concerned that this is a major weakness of the paper. Related to this, I find it hard to tell how carefully multiple comparison correction was performed throughout. It seems reasonably clear for the TBSS analyses, but then other analyses were performed in ROIs. Are these multiple comparisons corrected as well? Similarly, in Methods, I am confused by the statement that: "post hoc t tests corrected for multiple comparisons whenever a significant effect was detected". What does this mean?

      We thank the reviewer for this comment. We agree that a small sample size was a weakness of the previous version of the paper, and therefore, in the new version, we have substantially increased the n for both animal and human experiments (from n=9 to 19 in animals, from 16 to 21 in humans). We removed the ROI analysis in the new version, and thus the confusing statement, and clarified the strategy for multiple comparisons.

      Changes in the manuscript

      • Data analysis, pag. 18: Lesion masks were excluded from the statistical analysis, and multiple comparisons across clusters were controlled for by using threshold-free cluster enhancement.

      1.3 While I do not think the text is in any sense deliberately misleading, I think the authors would do well to either tone down their claims or consider more carefully the implications of the text in many places. Some that stuck out for me are:

      Throughout, language in the paper (e.g., "Paired t tests were used to assess differences in the axonal diameter") presumes that the AxCaliber estimates specifically reflect axon diameter. I think the jury is out over whether this is true, particularly for measurements conducted with limited hardware specs. At the very least, I would encourage the author to refer to these measurements throughout as "estimates" of axon diameter.

      Thank you for this clarification. We have indeed changed the notation, and now consistently refer to the estimates of axon diameter through MRI as the “MRI axonal diameter proxy”.

      1.4 The authors suggest that their results provide "new tools for patient stratification" based on differences in lesion type, but it isn't clear what new information these markers would confer given that the lesions are differentiated based on T1w hypo/hyperintensities. In other words, these lesions are by definition already differentiable from a much simpler MRI marker.

      Thank you for this insightful comment. The reviewer is right, and following the general reviewers’ assessment we have decided to not include the lesion analysis in the new version of the manuscript.

      1.5 The authors note in the Discussion that: "sensitive to early stages of axonal degeneration, even before alterations in the myelin sheet are detected". Whether intentional or not, the implication in the context of this study is that this would hold for MS (that these markers would detect axonal degeneration preceding demyelination). While there is some discussion of alterations to axonal diameter in MS, the authors do not discuss whether these are the same mechanisms thought to occur in the IBO intervention used here.

      Thank you for this comment. Indeed, the scope of the paper is not to assess whether axonal swelling precedes or not myelin alterations, so we agree with the reviewer that this sentence might be misleading and have removed it in the text. While we do not claim that ibotenic acid injections are able to replicate the complex clinical and pathological manifestation of MS (and now we made it clear in the revised manuscript, see comment 1), the electron microscopy paradigm indicates the presence of axonal swelling in the damaged fimbria, which is indeed the same pathological manifestation found in MS post-mortem data (see e.g. Fisher et al. Ann Neurol 2007).

      1.6 In the Discussion, the authors note the lack of evidence for a relationship with disability or disease duration, but nevertheless, go on to interpret the "trends" they do observe. I would advise strongly against this: the authors acknowledge that their numbers are low, so I would avoid the temptation to speculate here.

      The reviewer is 100% correct. We should have refrained from speculating. In the new version of the paper, however, thanks to the larger human cohort, we were able to find significant associations with disease duration in voxelwise analysis of the white matter skeleton in standard space and in the whole white matter in single subject space (new Figure 5).

      1.7 In the Discussion state that "the use of neurofilaments has also been well validated in MS". Well validated for what? MS is a complex disease with a broad range of pathology, so this statement could be read to mean "neurofilaments are known to be altered in MS". However, in the context of this paragraph, the implication would seem to be that neurofilaments are a wellestablished proxy for axonal diameter. Is that the implication, and if so what general evidence is there for this?

      We thank the reviewer for this insightful comment. Indeed, altered neurofilaments are not conclusive evidence of increased axonal diameter. In this context, the addition of electron microscopy data in the new manuscript version supports the claim.

      Reviewer #2 (Public Review):

      Diffusion MRI is sensitive to the brain microstructure, and it has been used to assess the integrity of white matter for nearly 3 decades. Its main limitation is the limited specificity, which makes it difficult to link changes in diffusion parameters to a given pathological substrate. Recently methods based on diffusion MRI that enable the estimation of axonal diameter, non invasively, have become available. This paper aims at validating one of such methods using an experimental model of neurodegeneration. The authors found a significant correlation between axonal diameter estimated by MRI and an histological marker of neurodegeneration. Although this is of great interest, as it demonstrates that this method is sensitive to neurodegeneration, a direct validation would require a measurement of axonal diameter using electron or confocal microscopy, rather than a correlation with a measure of axonal degeneration not directly related to axonal diameter. So, although these data are compelling, they do not prove that the increase in axonal diameter suggested by diffusion MRI corresponds to actual axonal swelling. The Authors also apply the same method to compare the white matter of patients with multiple sclerosis (MS) and healthy controls, showing widespread increases in axonal diameter in the patients. These data are compelling, but again, not conclusive. Other factors such as gloss could bias the MRI measurement and lead to an apparent increase in axonal diameter.

      We would like to thank the reviewer for the positive assessment of our work and for the valuable suggestion. We are confident that the new version of the manuscript, by including an extensive validation based on electron microscopy, has addressed the reviewer´s criticisms.

      Reviewer #3 (Public Review):

      3.1 In this paper, Toschi et al. performed dMRI to in vivo estimate axon diameter in the brain and demonstrated that multi-compartmental modeling (AxCaliber) is sensitive to microstructural axonal damage in rats and axon caliber increase in demyelinating lesions in MS patients, suggesting that axon diameter mapping provides a potential biomarker to bridge the gap between medical imaging contrasts and biological microstructure. In particular, authors injected ibotenic acid (IBO) and saline in the left and right rat hippocampus, respectively, and compared in vivo estimated axon diameter and ex vivo neurofilament staining in left and right fimbria. The axon size estimation was larger in the fimbria of IBO injection side, where the neurofilament intensity is higher. Correlation of axon size estimation and neurofilament intensity was observed in both injection sides. Further, higher axon diameter estimation was observed in normal appearing white matter (NAWM) of MS patients, compared with the healthy subjects. The axon size estimation increased in hypointense lesions of T1 weighted contrast, but not in isointense lesions. Through the comparison of dMRI-estimated axon size and histology-based fluorescence intensity, authors indirectly validated the sensitivity of axon diameter mapping to the tissue microstructure in the rat brain, and further explored the axon size change in the brain of MS patients. However, the dMRI protocol and biophysical modeling in this study were not fully optimized to maximize the sensitivity to axon size estimation, and the dMRI-estimated axon size (4.4-5.4 micron) was much larger than values reported in previous histological studies (0.5-3 micron) [Barazany et al., Brain 2009]. Finally, although the modified AxCaliber model incorporated two fiber bundles in different directions, the fiber dispersion in each bundle was not considered (c.f. fiber dispersion ~20-30 degree in corpus callosum), potentially leading to overestimated axon diameter.

      We thank the reviewer for their appreciation of our work, which we believe is substantially improved in this revised version through the inclusion of an electron microscopy paradigm. Below, the point-by-point response to the specific points raised.

      3.2 The conclusions in this study are supported by experimental results. However, the dMRI protocol and biophysical model could be further optimized and validated: 1. To in vivo estimate the axon diameter ~1 micron using dMRI, strong diffusion weighting (b-value) should be applied to maximize the signal decay due to intra-axonal restricted diffusion and minimize the signal contribution of extra-cellular hindered diffusion. However, authors only applied maximal b-value = 4000 s/mm2, much smaller than values ~15,00020,000 s/mm2 in previous studies [Assaf et al., MRM 2008; Huang et al., BSAF 2020, 225:1277]. The use of low diffusion weighting in this study leads to a lower bound ~4-6 micron for accurate diameter estimation, the so-called resolution limit in [Nilsson et al., NMR Biomed 2017, 30:e3711]. In other words, the estimated axon diameter is potentially overestimated and related with the imaging protocol and image quality, confounding the biological interpretation.

      We thank the reviewer for this insightful comment. Indeed, while the resolution limit is a concern, the chosen b-value has been a compromise between sensitivity to small structure and SNR, as indicated by recent animal (Crater et al., 2022) and human (Jensen et al., 2016; McKinnon et al., 2017; Moss et al., 2019) work, pointing at 3000-4000 s/mm2 as the b-value for which the intra-axonal water signal is dominant. In addition, a paper from the laboratory that first developed the Axcaliber method recently came out (Gast et al., 2023, DOI: 10.1007/s12021-023-09630-w) demonstrating that an MRI protocol with a maximum b-value between 3000 and 4000 s/mm2 (and even lower) is sufficient to capture, in vivo and in humans, various well-known aspects of axonal morphometry (e.g., the corpus callosum axon diameter variation) as well as other aspects that are less explored (e.g., axon diameter-based separation of the superior longitudinal fasciculus into segments). The same paper contains resources and further bibliography supporting the fact that experimental evidence suggests that the contribution of intra-axonal water to restricted diffusion signals dominates other factors (see Online Resource 1, section A of the same paper). To challenge this recent evidence from a neurobiology perspective, we include in the supplementary material a subset of experiments in animals with lower maximum b-value (2500 s/mm2, Fig. S1), where we are able to detect the same effect of increased MRI axonal diameter proxy in the injected hemisphere compared to control.

      We would like to add that while extremely valuable and informative, simulation studies such as the excellent study by Veraart et al., 2020, are inevitably valid under certain assumptions. Among them, some critical ones are i) the need to neglect nonaxonal cells such as glia, ii) assuming that the bulk diffusivity of water in cerebral tissue would be the same as that of free water, and iii) impermeable barriers. All these assumptions are expected to play a role in the estimated resolution limit, a role difficult to quantify but likely substantial.

      For this reason, we believe that our approach, which is 100% focused on neurobiology and measurements performed in real tissue, can offer a different perspective and fuel the ongoing debate on axonal diameter measurement feasibility. We acknowledge the value of the reviewer comment and discuss the issue of b-value in the discussion (see also comment 1.8).

      Changes in the manuscript

      • Discussion, pag. 12:<br /> Despite some inevitable minor differences due to different brain sizes and magnet features, the human protocol was built to match the main characteristics of the preclinical diffusion sequence, such as the b-value and diffusion time range. The chosen b-value has been a compromise between sensitivity to small structures and the signalto-noise ratio (SNR), as indicated by recent animal (Crater et al., 2022) and human (Gast et al., 2023; Jensen et al., 2016; McKinnon et al., 2017; Moss et al., 2019) work, pointing at 4000 s/mm2 as the b-value for which the intra-axonal water signal is dominant. However, following recent work supporting sensitivity of diffusion-weighted MRI to axonal diameter even at lower b-values (Gast et al., 2023), we tested a protocol with a lower b-value in a subset of animals, with the aim of facilitating future clinical AxCaliber studies. We found no qualitative differences in the outcome (MRI axonal diameter proxy was increased following fimbria damage). Further work and perhaps more realistic simulations, considering real cell composition and morphology, are needed to clarify this issue.

      3.3 In this study, the positive correlation of dMRI-estimated axon size and neurofilament fluorescence intensity is indeed an encouraging result, and yet this validation is indirect since it relies on the positive correlation between neurofilament intensity and axon diameter in histology.

      The reviewer correctly points out a severe limitation of the previous manuscript version, which is now addressed by including an extensive electron microscopy evaluation, recapitulated in new Fig. 3.

      3.4 Authors did not consider the fiber dispersion in the proposed dMRI model. This can lead to overestimated axon diameter, even in the highly aligned WM, such as corpus callosum with ~20-30 degree dispersion in histology [Ronen et al., BSAF 2014, 219:1773; Leergaard et all, PLoS One 2010, 5(1), e8595] and MRI [Dhital et al., NeuroImage 2019, 189, 543; Novikov et al., NeuroImage 2018, 174:518].

      The reviewer is correctly pointing out an important characteristic of while matter microstructure as is fibre dispersion. However, we would like to point out that the use of a second fiber population is expected to mitigate this effect by absorbing some axonal directional dispersion in areas of a single fiber. To support this, we quantified dispersion as the angle between the two main fiber orientations captured by the AxCaliber fit, as showed in Author response image 1 for two representative subjects (one control, upper line, and one MS, lower line; the “dispersion” maps are masked by a white matter probability mask, and superimposed to a T2w). Indeed, the angle between the two main fibres in the corpus callosum is around 20 degrees or lower, compatible with the bibliography cited by the reviewer, and higher in other white matter areas known to be characterized by fiber crossing and dispersion.

      Author response image 1.

      Angle in radians between the two main fiber orientations captured by the AxCaliber fit, as showed below for two representative subjects (one control, upper line, and one MS, lower line). The dispersion maps are masked by a white matter probability mask (P>=0.95), and superimposed to a T2-weighted image.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper shows that a principled, interpretable model of auditory stimulus classification can not only capture behavioural data on which the model was trained but somewhat accurately predict behaviour for manipulated stimuli. This is a real achievement and gives an opportunity to use the model to probe potential underlying mechanisms. There are two main weaknesses. Firstly, the task is very simple: distinguishing between just two classes of stimuli. Both model and animals may be using shortcuts to solve the task, for example (this is suggested somewhat by Figure 8 which shows the guinea pig and model can both handle time-reversed stimuli).

      The task structure is indeed simple. In the context of categorization tasks that are typically used in animal experiments, however, we would argue that we are the higher end of stimulus complexity. Auditory categories used in most animal experiments typically employ a category boundary along a single stimulus parameter (for example, tone frequency or modulation frequency of AM noise). Only a few recent studies (for example, Yin et al., 2020; Town et al., 2018) have explored animal behavior with “non-compact” stimulus categories. Thus, we consider our task a significant step towards more naturalistic tasks.

      We were also faced with the practical factor of the trainability of guinea pigs (GPs). Prior to this study, guinea pigs have been trained using classical conditioning and aversive reinforcement on detecting tone frequency (e.g., Heffner et al., 1971; Edeline et al., 1993). More recently, competitive training paradigms have been developed for appetitive conditioning, using a single “footstep” sound as a target stimulus and manipulated sounds as non-target stimuli (Ojima and Horikawa, 2016). But as GPs had never been trained on more complex tasks before our study, we started with a conservative one vs. one categorization task. We mention this in the Discussion section of the revised manuscript (page 27, line 665).

      To determine whether these results hold for more complex tasks as well, after receiving the reviews of the original manuscript, we trained two GPs (that were originally trained and tested on the wheeks vs. whines task) further on a wheeks vs. many (whines, purrs, chuts) task. As earlier, we tested these GPs with new exemplars and verified that they generalized. In the figure below, the average performance of the two GPs on the regular (training) stimuli and novel (generalization) stimuli are shown in gray bars, and individual animal performances are shown as colored discs. The GPs achieved high performance for the novel stimuli, demonstrating generalization. We also implemented a 4-way WTA stage for a wheek vs. many model and verified that the model generalized to new stimuli as well.

      For frequency-shifted calls, these two GPs performed better for wheeks vs. many compared to the average for wheeks vs. whines shown in the main manuscript. The 4-way WTA model closely tracked GP behavioral trends.

      The psychometric curves for wheeks vs. many categorization in noise (different SNRs) did not differ substantially from the wheeks vs. whines task.

      We focused our one vs. many training on the two conditions that showed the greatest modulation in the one vs. one tasks. However, these preliminary results suggest that the one vs. one results presented in the manuscript are likely to extend to more complex classification tasks as well. We chose not to include these new data in the revised manuscript because we performed these experiments on only 2 animals, which were previously trained on a wheeks vs. whines task. In future studies, we plan to directly train animals on one vs. many tasks.

      Secondly, the predictions of the model do not appear to be quite as strong as the abstract and text suggest.

      We now replace subjective descriptors with actual effect size numbers to avoid overstatingresults. We also include additional modeling (classification based on the long-term spectrum) and discuss alternative possibilities to provide readers with points of comparison. Thus, readers can form their own opinions of the strengths of the observed effects.

      The model uses "maximally informative features" found by randomly initialising 1500 possible features and selecting the 20 most informative (in an information-theoretic sense). This is a really interesting approach to take compared to directly optimising some function to maximise performance at a task, or training a deep neural network. It is suggestive of a plausible biological approach and may serve to avoid overfitting the data. In a machine learning sense, it may be acting as a sort of regulariser to avoid overfitting and improve generalisation. The 'features' used are basically spectro-temporal patterns that are matched by sliding a crosscorrelator over the signal and thresholding, which is straightforward and interpretable.

      This intuition is indeed accurate – the greedy search algorithm (described in the original visionpaper by Ullman et al., 2002) sequentially adds features that add the most hits and the least false alarms compared to existing members of the MIF set to the final MIF set. The latter criterion (least false alarms) essentially guards against over-fitting for hits alone. A second factor is the intermediate size and complexity of MIFs. When MIFs are too large, there is certainly overfitting to the training exemplars, and the model does not generalize well (Liu et al., 2019).

      It is surprising and impressive that the model is able to classify the manipulated stimuli at all. However, I would slightly take issue with the statement that they match behaviour "to a remarkable degree". R^2 values between model and behaviour are 0.444, 0.674, 0.028, 0.011, 0.723, 0.468. For example, in figure 5 the lower R^2 value comes out because the model is not able to use as short segments as the guinea pigs (which the authors comment on in the results and discussion). In figure 6A (speeding up and slowing down the stimuli), the model does worse than the guinea pigs for faster stimuli and better for slower stimuli, which doesn't qualitatively match (not commented on by the authors). The authors state that the poor match is "likely because of random fluctuations in behavior (e..g motivation) across conditions that are unrelated to stimulus parameters" but it's not clear why that would be the case for this experiment and not for others, and there is no evidence shown for it.

      Thank you for this feedback. There are two levels at which we addressed these comments inthe revised manuscript.

      First, regarding the language – we have now replaced subjective descriptors with the statement that the model captures ~50% of the overall variance in behavioral data. The ~50% number is the average overall R2 between the model and data (0.6 and 0.37 for the chuts vs. purrs and wheeks vs. whine tasks respectively). We leave it to readers to interpret this number.

      Second, our original manuscript lacked clarity on exactly what aspects of the categorization behavior we were attempting to model. As recent studies have suggested, categorization behavior can be decomposed into two steps – the acquisition of the knowledge of auditory categories, and the expression of this knowledge in an operant task (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022). Our model solely addresses how knowledge regarding categories is acquired (through the detection of maximally informative features). Other than setting a 10% error in our winner-take-all stage, we did not attempt to systematically model any other cognitive-behavioral effects such as the effect of motivation and arousal. Thus, in the revised manuscript, we have included a paragraph at the top of the Results section that defines our intent more clearly (page 5, line 117). We conclude the initial description of the behavior by stating that these factors are not intended to be captured by the model (page 6, line 171). We also edited a paragraph in the Discussion section for clarity on this point (page 26, line 629).

      In figure 11, the authors compare the results of training their model with all classes, versus training only with the classes used in the task, and show that with the latter performance is worse and matches the experiment less well. This is a very interesting point, but it could just be the case that there is insufficient training data.

      This could indeed be the case, and we acknowledge this as a potential explanation in therevised manuscript (page 22, line 537; page 27, line 653). Our original thinking was that if GPs were also learning discriminative features only using our training exemplars, they would face a similar training data constraint as well. But despite this constraint, the model’s performance is above d’=1 for natural calls – both training and novel calls; it is only the similarity with behavior on the manipulated stimuli that is lower than the one vs. many model. This phenomenon warrants further investigation.

      Reviewer #2 (Public Review):

      Kar et al aim to further elucidate the main features representing call type categorization in guinea pigs. This paper presents a behavioral paradigm in which 8 guinea pigs (GPs) were trained in a call categorization task between pairs of call types (chuts vs purrs; wheek vs whines). The GPs successfully learned the task and are able to generalize to new exemplars. GPs were tested across pitch-shifted stimuli and stimuli with various temporal manipulations. Complementing this data is multivariate classifier data from a model trained to perform the same task. The classifier model is trained on auditory nerve outputs (not behavioral data) and reaches an accuracy metric comparable to that of the GPs. The authors argue that the model performance is similar to that of the GPs in the manipulated stimuli, therefore, suggesting that the 'mid-level features' that the model uses may be similar to those exploited by the GPs. The behavioral data is impressive: to my knowledge, there is scant previous behavioral data from GPs performing an auditory task beyond audiograms measured using aversive conditioning by Heffner et al., in. 1970. [One exception that is notably omitted from the manuscript is Ojima and Horikawa 2016 (Frontiers)]. Given the popularity of GPs as a model of auditory neurophysiology these data open new avenues for investigation. This paper would be useful for neuroscientists using classifier models to simulate behavioral choice data in similar Go/No-Go experiments, especially in guinea pigs. The significance of the findings rests on the similarity (or not) of the model and GP performance as a validation of the 'intermediary features' approach for categorization. At the moment the study is underpowered for the statistical analysis the authors attempt to employ which frequently relies on non-significant p values for its conclusions; using a more sophisticated approach (a mixed effects model utilizing single trial responses) would provide a more rigorous test of the manipulations on behavior and allow a more complete assessment of the authors' conclusions.

      We thank the reviewer for their feedback and the suggestion for a more robust statistical approach. We have now replaced the repeated measures ANOVA based statistics for the behavior and model where more than 2 test conditions were presented (SNR, segment length, tempo shift, and frequency shift) with generalized linear models with a logit link function (logistic activation function). In these models, we predict the trial-by-trial behavioral or model outcome from predictors including stimulus type (Go or Nogo), parameter value (e.g., SNR value), parameter sign (e.g., positive or negative freq. shift), and animal ID as a random effect. To evaluate whether parameter value and sign had a significant contribution to the model, we compare this ‘full’ model against a null model that only has stimulus type as a predictor and animal ID as a random effect. These analyses are described in detail in the Materials and Methods section of the revised manuscript (page 36, line 930).

      These analyses reveal significant effects of segment length changes, and weak effects of tempo changes on behavior (as expected by the reviewer). Both the behavior and model showed similar statistical significance (except tempo shift for wheeks vs. whines) for whether performance was significantly affected by a given parameter.

      The behavioral data presented here are descriptive. The central conceptual conclusions of the manuscript are derived from the comparison between the model and behavioral data. For these comparisons, the p-value of statistical tests is not used. We realized that a description of how we compared model and behavioral data was not clear in the original manuscript. To compare behavioral data with the model, we fit a line to the d’ values obtained from the model plotted against the d’ values obtained from behavior, and computed the R2 value. We used the mean absolute error (MAE) to quantify the absolute deviation between model and behavior d’ values. Thus, high R2 values would signify a close correspondence between the model and behavior regardless of statistical significance of individual data points. We now clarify this in page 12, line 289. We derive R2 values for individual stimulus manipulations, as well as an overall R2 by pooling across all manipulations (presented in Fig. 11). This is now clarified in page 21, line 494.

      Reviewer #3 (Public Review):

      The authors designed a behavioral experiment based on a Go/ No-Go paradigm, to train guinea pigs on call categorization. They used two different pairs of call categories: chuts vs. purrs and wheeks vs. whines. During the training of the animals, it turned out that they change their behavioral strategies. Initially, they do not associate the auditory stimuli with rewards, and hence they overweight the No-Go behavior (low hit and false alarm rate). Subsequently, they learned the association between auditory stimuli and reward, leading to overweighting the Go behavior (high hit and false alarm rates). Finally, they learn to discriminate between the two call categories and show the corresponding behaviors, i.e. suppress the Go behavior for No-go stimuli (improved discrimination performance due to stable hit rates but lower false alarm rates).

      In order to derive a mechanistic explanation of the observed behaviors, the authors implemented a computational feature-based model, with which they mirrored all animal experiments, and subsequently compared the resulting performances.

      Strengths:

      In order to construct their model, the authors identified several different sets of so-called MIFs (most informative features) for each call category, that were best suited to accomplish the categorization task. Overall, model performance was in general agreement with behavioral performance for both the chuts vs. purrs and wheeks vs. whines tasks, in a wide range of different scenarios.

      Different instances of their model, i.e. models using different of those sets of MIFs, performed equally well. In addition, the authors could show that guinea pigs and models can generalize to categorize new call exemplars very rapidly.

      The authors also tested the categorization performance of guinea pigs and models in a more realistic scenario, i.e. communication in noisy environments. They find that both, guinea pigs and the model exhibit similar categorization-in-noise thresholds.

      Additionally, the authors also investigated the effect of temporal stretching/compression of calls on categorization performance. Remarkably, this had virtually no negative effect on both, models and animals. And both performed equally well, even for time reversal. Finally, the authors tested the effect of pitch change on categorization performance, and found very similar effects in guinea pigs and models: discrimination performance crucially depends on pitch change, i.e. systematically decreases with the percentage of change.

      Weaknesses:

      While their computational model can explain certain aspects of call categorization after training, it cannot explain the time course of different behavioral strategies shown by the guinea pigs during learning/training.

      Thank you for bringing this up – in hindsight the original manuscript lacked clarity on exactlywhat aspects of the behavior we were trying to model. As recent studies have suggested, categorization behavior can be decomposed into two steps – the acquisition of the knowledge of auditory categories, and the expression of this knowledge in an operant task (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022) . Our model solely addresses how knowledge regarding categories is acquired (through the detection of maximally informative features). Other than setting a 10% error in our winner-take-all stage, we did not attempt to systematically model any other cognitive-behavioral effects such as the effect of motivation and arousal, or behavioral strategies. Thus, in the revised manuscript, we have included a paragraph at the top of the Results section that defines our intent more clearly (page 5, line 117). We conclude the initial description of the behavior by stating that these factors are not intended to be captured by the model (page 6, line 171). We also edited a paragraph in the Discussion section for clarity on this point (page 26, line 629).

      Furthermore, the model cannot account for the fact that short-duration segments of calls (50ms) already carry sufficient information for call categorization in the guinea pig experiment. Model performance, however, only plateaued after a 200 ms duration, which might be due to the fact that the MIFs were on average about 110 ms long.

      The segment-length data indeed demonstrates a deviation between the data and the model.As we had acknowledged in the original manuscript, this observation suggests further constraints (perhaps on feature length and/or bandwidth) that need to be imposed on the model to better match GP behavior. We originally did not perform this analysis because we wanted to demonstrate that a model with minimal assumptions and parameter tuning could capture aspects of GP behavior.

      We have now repeated the modeling by constraining the features to a duration of 75 ms (thelowest duration for which GPs show above-threshold performance). We found that the constrained MIF model better matched GP behavior on the segment-length task (R2 of 0.62 and 0.58 for the chuts vs. purrs and wheeks vs. whines tasks; with the model crossing d’=1 for 75 ms segments for most tested cases). The constrained MIF model maintained similarity to behavior for the other manipulations as well, and yielded higher overall R2 values (0.66 for chuts vs. purrs, 0.51 for wheeks vs. whines), thereby explaining an additional 10% of variance in GP behavior.

      In the revised manuscript, we included these results (page 28, line 699), and present results from the new analyses as Figure 11 – Figure Supplement 2.

      In the temporal stretching/compressing experiment, it remains unclear, if the corresponding MIF kernels used by the models were just stretched/compressed in a temporal direction to compensate for the changed auditory input. If so, the modelling results are trivial. Furthermore, in this case, the model provides no mechanistic explanation of the underlying neural processes. Similarly, in the pitch change experiment, if MIF kernels have been stretched/compressed in the pitch direction, the same drawback applies.

      We did not alter the MIFs in any way for the tests – the MIFs were purely derived by trainingthe animal on natural calls. In learning to generalize over the variability in natural calls, the model also achieved the ability to generalize over some manipulated stimuli. The fact that the model tracks GP behavior is a key observation supporting our argument that GPs also learn MIF-like features to accomplish call categorization.

      We had mentioned at a few places that the model was only trained on natural calls. To addclarity, we have now included sentences in the time-compression and frequency-shifting results affirming that we did not manipulate the MIFs to match test stimuli. We also include a couple of sentences in the Discussion section’s first paragraph stating the above argument (page 26, line 615).

    1. Author Response

      Reviewer #1 (Public Review):

      Causality is important and desired but usually difficult to establish. In this work, Park et al. conducted a comprehensive phenome-wide, two-sample Mendelian randomization analysis to infer the casual effects of plasma triglyceride (TG) levels on 2,600 disease traits. They identified causal associations between plasma TG levels and 19 disease traits, related to both atherosclerotic cardiovascular diseases (ASCVD) and non-ASCVD diseases. They used biobank-scale data in both discovery analysis and replication analysis.

      The conclusions of this work are mostly supported by the data and analysis, but some aspects need to be clarified and extended.

      (1) The datasets used in this study may not be very consistent. For example, UKB participants are aged 40-69 years old at recruitment. In addition, UKB is United Kingdom-based and FinnGen is Finland-based. So the definition of outcomes may not be identical. The authors should discuss the differences between the datasets and their potential effects.

      The reviewer is correct about the differences between UKB and FinnGen and that the definition of clinical outcomes between the two datasets may not be identical due to differences in healthcare systems and population demographics. We now mention this in the discussion section as a potential limitation.

      Manuscript changes:

      Line 520-539: “Third, UKB and FinnGen have innate differences in participant demographics and medical coding systems, due in part to the former being based in the United Kingdom and the latter in Finland. As such, potential misclassification of participants in case-control assignment is a liability to this study. We exercised caution in mapping UKB traits to FinnGen traits, but we were unable to reliably map all “categorical” traits from UKB to corresponding traits in FinnGen, testing for replication only 221 of the 598 associations that were nominally significant in the primary analysis. We note however that, despite geographical differences, both datasets largely involve White European participants of older age, with the mean age in UKB and FinnGen being 56.5 and 59.8, respectively.”

      (2) The discovery analysis and replication analysis are not completely independent because data from UKB have been used in both analyses. Although in discovery, the data were used for association with outcomes; while in replication, the data were used for association with exposure. The authors may want to explain if this may cause problems.

      The reviewer is correct that UKB data were used in both the discovery and replication analyses with the caveat that the discovery analysis used UKB for outcomes while using GLGC for exposures, whereas the replication analysis used UKB for exposures while using FinnGen for outcomes. We believed this would be a creative use of three different datasets and a strength of the study; however, we agree that examining the implications of this study design is needed to acknowledge potential biases. We now expand on this in the discussion section as a potential limitation.

      Manuscript changes:

      Lines 539-545: “Fourth, discovery and replication analyses were not completely independent, since UKB data were used in both analyses. This could potentially exacerbate demographic and measurement biases inherent to UKB; however, we show that taking a traditional replication approach using GLGC instead of UKB for selecting exposure instruments in replication returns comparable Tier 1 results (Supplementary Files 5), while losing statistical power to highlight many of the Tier 2 and 3 results.”

      (3) As stated in the manuscript, there are three assumptions for MR analysis. The validity of the results depends on the validity of the assumptions. The last two assumptions are usually difficult to validate. To the authors' credit, they conducted sensitivity analyses addressing horizontal pleiotropy, which is related to assumption 3. It would be helpful if the authors can discuss those assumptions explicitly.

      We now explicitly state the assumptions of Mendelian randomization in the introduction section and discuss the validity of these assumptions in the discussion section.

      Manuscript changes:

      Lines 501-514: “The study has several limitations. First, MR is a powerful but potentially fallible method that relies on several key assumptions, namely that genetic instruments are (i) associated with the exposure (the relevance assumption); (ii) have no common cause with the outcome (the independence assumption); and (iii) have effects on the outcome solely through the exposure (the exclusion restriction assumption) (Hartwig et al., 2016). In MR, (i) is relatively straightforward to test, while (ii) and (iii) are difficult to establish unequivocally. As a prominent example, horizontal or type I pleiotropy has been shown to be common in genetic variation, which can bias MR estimates (Verbanck et al., 2018) (Jordan et al., 2019). This occurs when a genetic instrument is associated with multiple traits other than the outcome of interest. To detect and correct for this as best as possible, we used various MR tests as sensitivity analyses that each aim to adjust for or account for the presence of horizontal pleiotropy, including MR-PRESSO, as well as MR-Egger and weighted median methods. There is no universally accepted method that is perfectly robust to horizontal pleiotropy, but we take the best current approach by using multiple methods and examining the consistency of results.”

      Reviewer #2 (Public Review):

      This work conducted a Mendelian randomization analysis between TG and a large number of disease traits in biobanks. They leverage the publicly available summary statistics from the European samples from the UK Biobank and FinnGen. A solid but routine standard summary-statistics based MR study is conducted. Several significant causal associations from TG to phenotypes are called by setting p-value cutoff with some Bonferroni correction. Sensitivity statistical analyses are conducted which generate largely consistent results. The research problem is important and relevant for public health as well we drug development. Overall this is a solid execution of current methods over appropriate data source and yields a convincing result. The interpretation of the results in discussion is also well-balanced.

      While the paper does have strengths in principle, a few technical weaknesses are observed.

      They used UK Biobank as the discovery and FinnGen as the replication. But the two cohorts are rather used symmetrically. Especially for the Tier 3 (NB), it seems to be an attempt of reusing the replication cohort as the discovery. I wonder if that would create additional multiple testing burden as a greater number of hypotheses are considered.

      We thank the reviewer for this thought-provoking comment. As the reviewer is aware, MR studies have generally not accounted for multiple testing in the past since they have usually focused on single exposures and/or single diseases. Ours is among one of the more unique MR studies taking a phenome-wide, high-throughput approach, so determining the optimal threshold for balancing true-positive vs. false-positive discovery is an important aspect of the study warranting discussion.

      We agree that Tier 3 results carry the least stringent level of statistical evidence (i.e., nominally significant in discovery using UK Biobank and Bonferroni-significant in replication using FinnGen), and that these results should be interpreted with caution. As a phenome-wide study, a significant aim of this work was to generate hypotheses, and so, we decided to present our results using the three tiers of statistical evidence to highlight as many promising associations as possible for further investigation. Nevertheless, we now express extra caution in the results and discussion sections regarding Tier 2 and 3 results, and we also note as a limitation that these results especially require external replication.

      Manuscript changes:

      Lines 438-444: “Regarding non-ASCVDs, we present suggestive genetic evidence of potentially causal associations between plasma TG levels and uterine leiomyomas (uterine fibroids), diverticular disease of intestine, paroxysmal tachycardia, hemorrhage from respiratory passages (hemoptysis), and calculus of kidney and ureter (kidney stones). Due to the weaker statistical evidence supporting these associations, special caution is encouraged when interpreting these results to infer causality, and further replication and validation studies are essential for all Tier 2 and Tier 3 results.”

      The replication p-value cutoff is a bit statistically lenient. In a typical discovery-replication setting the two stages are conducted sequentially and replication should go through the Bonferroni adjustment on the number of significant signals from discovery that is tested in the replication. For example, in this case, in tier 2, the cutoff should be 0.05/39. This may make the association of leiomyoma of the uterus slightly non-significant though. Similar cutoff should be applied to tier 3 as well.

      We thank the Reviewer for highlighting this important point. We agree that in a standard two-stage discovery and replication study design, the Bonferroni adjustment should be based on the number of significant signals from discovery that is tested in the replication. We had initially considered this approach but chose the current tiered approach based on a number of factors:

      First, we had initially considered performing a standard meta-analysis between UK Biobank and FinnGen datasets and using the Bonferroni adjustment of the total number of tests. However, it was not possible to reliably map the phenotypes between UK Biobank and FinnGen on a large-scale due to different classification schemes.

      Second, we had noticed that if we only focus on the sequential two-stage design, then we would be ignoring strong causal relationships observed in FinnGen that passed Bonferroni adjustment but may only be nominally associated in UK Biobank. Although not as strong as Tier 1 findings, we believe that these findings warranted some consideration. This is particularly relevant since differences in the strength of the causal relationship could be attributed to the different populations studied, sample size, different health systems used to measure disease outcomes, differences in statistical power in the MR tests between the two stages (e.g., number of IVs), amongst others.

      Third, we wanted to point out that the total adjustment for number of phenotypes tested using Bonferroni is a very conservative adjustment because the multiple EHR phenotypes have varying degrees of redundancy and correlation. We believe the appropriate Bonferroni-adjusted P-value cutoff is somewhere in between the Bonferroni adjustment of total number of phenotypes, and the nominal P-value (no adjustment for number of phenotypes).

      Although somewhat unconventional, we came up with this tiered P-value approach to overcome the points mentioned above. We have now included text to further explain our approach and to mention that tier 2 and tier 3 results require further replication and validation.

      Manuscript changes:

      Lines 266-283: “This presentation is somewhat unconventional and partly arises from the study’s use of three different datasets for instrument selection. In a traditional two-stage discovery and replication design, Bonferroni adjustment is based on the number of significant signals from discovery that is tested in replication. Here, we used three tiers of statistical evidence to present results because a standard meta-analysis between UKB and FinnGen was not possible, given it was not possible to reliably map all phenotypes between the two datasets. Additionally, Bonferroni-significant results in the replication analysis would have been ignored in FinnGen in a sequential two-stage design if they were also only nominally associated in UKB. The three tiers are defined below:”

      Lines 441-444: “Due to the weaker statistical evidence supporting these associations, special caution is encouraged when interpreting these results to infer causality, and further replication and validation studies are essential for all Tier 2 and Tier 3 results.”

      Lines 498-500: “However, we reiterate that this Tier 3 association was only nominally significant in discovery, while Bonferroni-significant in replication, and future studies are needed to validate the statistical evidence.”

      Lines 565-567: “However, caution is still warranted in inferring causality, as MR depends on specific assumptions and the validity of those assumptions must be carefully assessed. Thus, diverse study designs remain necessary to triangulate evidence on the causal effects of plasma TG levels.”

      The causal effect of TG to leiomyoma of the uterus is weak, as indicated by both the sub-significant in the replication and the non-significant of MR-PRESSO. Similarly, I would recommend more caution on the weak statistical rigor when interpreting Tier 2 and Tier 3 results.

      We agree with the Reviewer. We have now emphasized more caution in interpreting Tier 2 and Tier 3 results. We have also explicitly restated the weaker statistical evidence underlying these results and noted need for future validation. Please see our detailed response to the Comment above.

      Manuscript changes:

      Lines 498-500: “However, we reiterate that this Tier 3 association was only nominally significant in discovery, while Bonferroni-significant in replication, and future studies are needed to validate the statistical evidence.”

      Another methodological choice that might need justification is the use of UKB TG GWAS loci (1,248 SNPs) are the instrument for FinnGen. This may create some subtle interference with the use of UKB as outcomes in the discovery analysis. It may be minor but some justification or at least some discussions of potential limitations should be mentioned. What about the alternative of using GLGC as instruments in replication?

      We agree with the reviewer that the use of UKB TG GWAS loci (1,248 SNPs) as instruments for FinnGen outcomes needs additional justification. We now detail this decision in the text as copied below.

      Additionally, we now present new data comparing MR results on FinnGen outcomes when selecting TG instruments from UKB GWAS versus GLGC GWAS. Statistical significance after Bonferroni correction was set to 0.05/221, where 221 was the number of disease traits nominally significant in UKB that were tested in FinnGen. We note that the results were fairly consistent. All Tier 1 results remained Bonferroni significant, whether using TG SNPs from UKB or GLGC. Though statistical significance decreased for the remaining diseases of interest, the direction of causality remained consistent, and three disease traits remained significant (hypertension, aortic aneurysm, and alcoholic liver disease). These results support that instrumenting TG using 1,248 SNPs from UKB might carry more power than the 141 SNPs from GLGC, allowing for the detection of associations in our initial replication analysis using UKB for exposures and FinnGen for outcomes. We now include this analysis in the text and include the figure below, as well as its underlying data, as supplementals (Supplementary File 5).

      Manuscript changes:

      Lines 229-236: “We selected UKB TG GWAS loci as the instruments for replication on FinnGen outcomes, rather than GLGC TG GWAS loci, to diversify the source of TG instruments and mitigate potential biases associated with one TG GWAS. Moreover, UKB GWAS included a larger study population than GLGC GWAS, providing a greater number of genetic instruments that can together explain more of the variance in plasma TG levels, and thus, greater statistical power and precision. Nevertheless, we also performed the replication analyses using TG instruments from GLGC and included these results as supplemental data (Supplementary File 5).”

      For disease outcomes (line 188), UKB European sample size is ~400,000 rather than ~500,000. Can the author clarify the sample size they used?

      We thank the reviewer for catching this detail. We have now clarified the sample size of UKB European participants in the Methods section, and we also included the exact sample size of each disease trait GWAS (cases and controls) in Supplementary Figure 1.

      Manuscript changes:

      Lines 194-201: “Pan-UKB had performed 16,131 GWASs on 7,221 phenotypes in ~420,531 UKB participants of European ancestry using genetic and phenotypic data (PanUKBTeam, 2020). A total of 7,221 total phenotypes had been categorized as “biomarker”, “continuous”, “categorical”, “ICD-10 code”, “phecode”, or “prescription” (PanUKBTeam, 2020). We filtered for outcomes to retain categorical, ICD-10, and phecode types; non-null heritability in European ancestry as estimated by Pan-UKB; and relevance to disease, excluding medications. This yielded 2,600 traits for primary analysis. The exact sample size of each GWAS for each of these traits is provided in Supplementary File 1.”

      It would be reassuring to the reader if the TG measurements were measured in a treatment-naïve manner. GLGC accounted for treatment (at least LDL, check paper for TGs; if they didn’t, there must be reason). Maybe not UKB.

      We now provide information about whether the lipid measurements were measured in a treatment-naïve manner in the Methods for GLGC and UKB. We also address this point in the discussion section as a potential limitation.

      Manuscript changes:

      Lines 179-180: “We note that the GLGC GWAS had excluded individuals known to be on lipid-lowering medications.”

      Lines 187-188: “We note that the Pan-UKB GWAS study did not exclude participants based on their use of lipid-lowering medications.”

      Lines 545-546: “Fifth, the GLGC GWAS used to select instruments for plasma TG levels in discovery had accounted for lipid-lowering treatment, while the UKB GWAS used in replication had not.”

      "Phenome-wide MR is a high-throughput extension of MR that, under specific assumptions, estimates the causal effects of an exposure on multiple outcomes simultaneously." - I guess it is more informative to mention the specific assumptions, at least briefly, in the introduction so it is easier for the reader to interpret the results.

      We agree with the reviewer that it would be informative to explicitly state the assumptions of Mendelian randomization. We now explicitly state these assumptions in the introduction.

      Manuscript changes:

      Lines 123-129: “Phenome-wide MR is a high-throughput extension of MR that estimates the causal effects of an exposure on multiple outcomes simultaneously. As in conventional MR, this method uses genetic variants as instrumental variables (IV) to proxy modifiable exposures (Davey Smith & Ebrahim, 2003), and importantly, it relies on three critical assumptions: (1) The genetic variant is directly associated with the exposure; (2) The genetic variant is unrelated to confounders between the exposure and outcome; and (3) The genetic variant has no effect on the outcome other than through the exposure (Davey Smith & Ebrahim, 2003).”

      Reviewer #3 (Public Review):

      Park and Bafna et al. applied a genetics-based epidemiological approach, the Mendelian randomization analysis (MR), to evaluate the potential causal roles of triglycerides across 2,600 disease traits (i.e., the phenome). In a typical two-sample MR framework, they utilized existing genome-wide association study (GWAS) summary statistics from two separate studies. They are Global Lipids Genetics Consortium (GLGC) and UK Biobank in the discovery analysis, and UK Biobank and FinnGen in the replication analysis. This replication design is a great strength of the study, enhancing the robustness and reproducibility of the results. For the candidate pairs of causal associations, the authors further perform multiple sensitivity analyses to evaluate the robustness of the results to possible violations of assumptions in MR. To disentangle the independent effects of triglycerides from other lipid fractions (i.e., LDL-cholesterol and HDL-cholesterol), the authors performed multivariable MR analysis. In the end, possible causal associations were revealed in three tiers, based on statistical significance in the two-stage analysis. The results support the causal effects of triglycerides in increasing the risk of atherosclerotic cardiovascular disease. They also reveal novel conditions, which are either new treatable conditions (e.g., leiomyoma, hypertension, calculus of kidney and ureter) for repurposing of triglycerides-lowering drug, or possible side effects (e.g., alcoholic liver disease) the triglyceride-lowering treatment should pay special attention to.

      The analysis approaches in the paper are standard and solid. The discovery-replication study design is a great strength. Correction for multiple testing was implemented in a conservative way. The sensitivity analyses and MVMR strengthen the robustness of the results. The manuscript is very clearly written and pleasant to read. The limitations were well-presented. The conclusions and interpretations are mostly supported by the data, with one major concern as explained below. But overall, in addition to the specific findings, this study could be an exemplar study for the use of phenome-wide MR in identifying treatable conditions and side effects for most existing drugs.

      1) My major concern is about reverse causation. For example, having atherosclerotic cardiovascular disease increases circulating triglycerides. Reverse causation can induce false positives in MR analysis. With the existing data in this study, the authors can perform a reverse MR to evaluate the effect of the 19 disease traits on triglycerides. Ruling out the presence of reserve causation is important to make sure that the current findings are not false positives.

      We agree with the reviewer that performing reverse MR would be important to rule out reverse causation. We now present new results using reverse MR, selecting instruments for disease from UKB and instruments for TG from GLGC (i.e., reversing the discovery analysis). We provide an interpretation of these new results in the discussion section and present the underlying data, including the number of genetic variants used, in Supplementary File 6. Please note we could only perform reverse MR on 9 of the 19 diseases of interest, due to insufficient genetic data in GLGC to extract the specific exposure instruments. As expected, we observed significant associations (orange) between “disorders of lipoprotein metabolism” and “hyperlipidemia” with plasma TG levels; however, all other estimates were non-significant, suggesting unidirectional associations for the remaining seven disease traits. We now include the figure below and its underlying data as supplements (Supplementary File 6).

      Manuscript changes:

      Lines 258-261 “Finally, we performed bidirectional or reverse MR on significant results to examine the potential presence of reverse causation. We selected instruments for each disease as described above from Pan-UKB and instruments for plasma TG levels from GLGC, essentially reversing the discovery stage design using a fixed-effect IVW method.”

      Lines 368-373: “Finally, we performed reverse MR to estimate the effects of significant disease traits on plasma TG levels, selecting instruments from UKB and GLGC, respectively. Genetic data were sufficiently available to perform this analysis for 9 of the 19 diseases of interest. These results are presented in Supplementary File 6. Expectedly, “disorders of lipoprotein metabolism” and “hyperlipidemia” had positive effects on plasma TG levels; however, no other examined disease trait showed results suggesting reverse causation.”

    1. Author Response:

      Reviewer #2:

      Non-canonical pathways for regulating protein synthesis serve important roles for controlling gene expression in critical developmental pathways. Homeobox (Hox) genes encode many mRNAs regulated at the level of translation. A general feature for many of these mRNAs has been the proposal they are regulated by Internal Ribosome Entry Sites (IRESs) and possess sequences in the 5'-untranslated regions (5'-UTR) of the mRNA that prevent canonical cap-dependent translation, termed "translation inhibitory elements" or TIEs. However, the mechanisms by which these Hox mRNAs are regulated remain unclear. Here, the authors focus on two Hox mRNAs, Hox a3 and Hox a11, and find they use entirely different means to achieve the same end of repressing cap-dependent translation. Hox a3 uses the non-canonical translation initiation factor eIF2D and an upstream open reading fram (uORF), whereas a11 uses a "start-stop" uORF followed by a thermodynamically stable stem-loop to inhibit translation. Overall, the experiments support the major conclusions drawn by the authors, and nail down mechanisms that have been left unresolved since the Hox mRNAs were first discovered to be regulated at the level of translation. These results will be of wide interest to the translation and developmental biology fields.

      Some issues the authors should consider:

      1) The mapping of the TIE boundaries are in general well-supported by the luciferase reporter experiments. However, there seems to be a disconnect in the luciferase values in Fig. 1B compared to the western blots in Supplementary Fig. 1D, however. For example, in the a3 case the 106 and 113 bands don't seem to correspond to levels consistent with the luciferase activity. For a11, the 153 band is not consistent with the luciferase activity. Also, the gels at the bottom are confusing. Should 74 in the left gel be 77? It would help to have a clearer explanation in the figure legend.

      The reviewer is right, supplementary figure 1D is misleading. We have clarified the data with a new supplementary figure 1D. The gels presented in this figure are not western blots, they are SDS-page analysis of translated product (i.e. Renilla luciferase protein) in the presence of 35S-Methionin. Since the function of TIE elements was measured in comparison with reporters that do not contain any TIE element, we loaded on each gel a reference (lanes w/o TIE) for quantification purposes. Since the exposure time of distinct gels was variable, one should not compare the intensities in between gels. We added the quantification of the gel intensity related to the reference construct (w/o TIE). We agree with the reviewer that the two gels at the bottom are not informative, we removed them from the new supplemental figure 1D.

      2) The results in the various sucrose gradients are not entirely convincing as presented. In all these cases, the experiment would benefit from the use of high-salt conditions (See Lodish and Rose, 1977, JBC 252, 1181-ff) in the gradient to remove background 80S not engaged with mRNAs. For the +cycloheximide sample in Fig. 8, this looks more like a "half-mer" between a monosome and disome, rather than a standard polysome.

      We do not agree with the point raised by the reviewer on sucrose gradients. Obviously this is due to a misunderstanding of the conducted experiments. We would like to remind that the plots shown in the manuscript represent the percentage of mRNA transcripts labelled with a radioactive cap that were introduced in cell-free translation extracts. Therefore, since we monitor only radioactivity, the sole radioactive mRNA transcripts tested in these experiments are observed, consequently there is no background 80S that are not engaged with mRNAs. Such background 80S are visible on the OD profile shown now in a novel supplementary figure S6. However, non-engaged 80S are not radioactive and mRNAs that are not engaged in the 80S are found in the RNP fraction. The absence of radioactive background 80S is further corroborated by the use of edeine that prevents the codon-anticodon interaction (see data below).

      When we setup our experimental strategy, we first used edeine to validate our protocol, in this case no radioactive 80S is observed confirming that no background 80S is present in our assays. In conclusion, peaks at the level of 80S can only be radioactive mRNA engaged in an 80S. We have extended the figure legend to clarify the conducted experiments.

      Concerning Fig 8, we agree that this experiment is not conclusive and propose to remove it as mentioned in response to a comment from reviewer #1.

      3) In Fig. 7, it would be helpful to see the absolute level of translation from the reporters, as it is not clear what the baseline level of translation is in the knockdown cell lines. It's hard to judge the eIF4E knockdown case in particular without this information. Also in panel B, the GGCCC147 cell line is missing.

      As previously mentioned, we agree that Fig 7 is misleading and we have completely remodelled the figure in the revised manuscript. See also point 5 from reviewer #1. Because the GGCCC147 mutation had no effect in RRL, we decided not to test it in HEK cells and focused on the GGCC107 that has a significant effect both in RRL and in HEK cells.

      4) From the MS experiments in Fig. 6 and Supplementary Fig. 6, the authors focus on eIF2D, which makes sense. But they don't comment on two other highly suggestive hits in the a3 vs. beta-globin and a3 vs. a11 comparisons. These are eIF5B and HBS1L. Both are highly suggestive of what might be going in with the eIF2D-dependent translation mechanism. They don't show up in the GMP-PNP samples in Supplementary Fig. 6, which is interesting and would deserve a comment.

      We are grateful for this very interesting comment. As suggested, we have inserted a comment related to HBS1L and eIF5B in the discussion of the manuscript.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors have tried to correlate changes in the cellular environment by means of altering temperature, the expression of key cellular factors involved in the viral replication cycle, and small molecules known to affect key viral protein-protein interactions with some physical properties of the liquid condensates of viral origin. The ideas and experiments are extremely interesting as they provide a framework to study viral replication and assembly from a thermodynamic point of view in live cells.

      The major strengths of this article are the extremely thoughtful and detailed experimental approach; although this data collection and analysis are most likely extremely time-consuming, the techniques used here are so simple that the main goal and idea of the article become elegant. A second major strength is that in other to understand some of the physicochemical properties of the viral liquid inclusion, they used stimuli that have been very well studied, and thus one can really focus on a relatively easy interpretation of most of the data presented here.

      There are three major weaknesses in this article. The way it is written, especially at the beginning, is extremely confusing. First, I would suggest authors should check and review extensively for improvements to the use of English. In particular, the abstract and introduction are extremely hard to understand. Second, in the abstract and introduction, the authors use terms such as "hardening", "perturbing the type/strength of interactions", "stabilization", and "material properties", for just citing some terms. It is clear that the authors do know exactly what they are referring to, but the definitions come so late in the text that it all becomes confusing. The second major weakness is that there is a lack of deep discussion of the physical meaning of some of the measured parameters like "C dense vs inclusion", and "nuclear density and supersaturation". There is a need to explain further the physical consequences of all the graphs. Most of them are discussed in a very superficial manner. The third major weakness is a lack of analysis of phase separations. Some of their data suggest phase transition and/or phase separation, thus, a more in-deep analysis is required. For example, could they calculate the change of entropy and enthalpy of some of these processes? Could they find some boundaries for these transitions between the "hard" (whatever that means) and the liquid?

      The authors have achieved almost all their goals, with the caveat of the third weakness I mentioned before. Their work presented in this article is of significant interest and can become extremely important if a more detailed analysis of the thermodynamics parameters is assessed and a better description of the physical phenomenon is provided.

      We thank you for the comments and, in particular, for being so positive regarding the strengths of our manuscript and for raising concerns that will surely improve it. We have taken the following actions to address your concerns:

      1) Extensive revisions have been made to the use of English, particularly in the abstract and introduction. Key terms are defined as they are introduced in the text to enhance the clarity of the argument. This is a significant revision that is highlighted within the text, but it is too extensive to detail here.

      2) In the results section, we improved and extended the discussion of our graphs to the extent possible. However, we found that attempting to explain the graphs' meanings more thoroughly would detract from our manuscript's main focus: identifying thermodynamic changes that could potentially lead to alterations in material properties, specifically aspect ratio, size, and Gibbs free energy. As a result, we introduced the type of information we could obtain from our analyses in the introduction (Lines 112-125) and briefly commented on it in the ‘results’ section (Lines 304-306, sentences below).

      From introduction – lines 112-125:

      “In addition, other parameters like nucleation density determine how many viral condensates are formed per area of cytosol. Overall, the data will inform us if changing one parameter, e.g. the concentration, drives the system towards larger condensates with the same or more stable properties, or more abundant condensates that are forced to maintain the initial or a different size on account of available nucleation centres (Riback et al., 2020:Snead, 2022 #1152). It will also inform us if liquid viral inclusions behave like a binary or a multi-component system. In a binary mixture, Cdilute is constant (Klosin et al., 2020). However, in multi-component systems, Cdilute increases with bulk concentration (Riback et al., 2020). This type of information could have direct implications about the condensates formed during influenza infection. As the 8 different genomic vRNPs have a similar overall structure, they could, in theory, behave as a binary system between units of vRNPs and Rab11a. However, a change in Cdilute with concentration would mean that the system behaves as a multi-component system. This could raise the hypothesis that the differences in length, RNA sequence and valency that each vRNP has may be relevant for the integrity and behaviour of condensates.”.

      From results lines 304-306:

      This indicates that the liquid inclusions behave as a multi-component system and allow us to speculate that the differences in length, RNA sequence and valency that each vRNP may be key for the integrity and behaviour of condensates.

      3) The reviewer has drawn our attention to the absence of phase separation analysis in our study. We believe that the formation of influenza A virus condensates is governed by phase separation (or percolation coupled to phase separation). However, we must exercise caution at this point because the condensates we are studying are highly complex, and the physics of our cellular system may not be adequate to claim phase separation without being validated by an in vitro reconstitution system. IAV inclusions contain a variety of cellular membranes, different vRNPs, and Rab11a. While we have robust data to propose a model in which the liquid-like properties of IAV inclusions arise from a network of interacting vRNPs that bridge multiple cognate vRNP-Rab11 units on flexible membranes, similar to what occurs in phase-separated vesicles in neurological synapses, our model for this system still lacks formal experimental validation. As a note, the data supporting our model includes: the demonstration of the liquid properties of our liquid inclusions (Alenquer et al. 2019, Nature Communications, 10, 1629); and impairment of recycling endocytic activity during IAV infection Bhagwat et al. 2020, Nat Commun, 11, 23; Kawaguchi et al. 2012, J Virol, 86, 11086-95; Vale-costa et al. 2016, J Cell Sci, 129, 1697-710. This leads to aggregated vesicles seen by correlative light and electron microscopy (Vale-Costa et al., 2016 JCS, 129, 1697-710) and by immunofluorescence and FISH (Amorim et al. 2011,. J Virol 85, 4143-4156; Avilov et al. 2012, Vaccine 30, 7411-7417; Chou et al. 2013, PLoS Pathog 9, e1003358; Eisfeld et al. 2011, J Virol 85, 6117-6126 and Lakdawala et al. 2014, PLoS Pathog 10, e1003971.

      To be able to explore the significance of the liquid material properties of IAV inclusions, we used the strategy described in this current work. By developing an effective method to manipulate the material properties of IAV inclusions, we provide evidence that controlled phase transitions can be induced, resulting in decreased vRNP dynamics in cells and a negative impact on progeny virion production. This suggests that the liquid character of liquid inclusions is important for their function in IAV infection. We have improved our explanation addressing this concern in the limitations of our study (as outlined below in the box and in manuscript in lines 857-872).

      We are currently establishing an in vitro reconstitution system to formally demonstrate, in an independent publication, that IAV inclusions are formed by phase separation (or percolation coupled to phase separation). For this future work, we teamed up with Pablo Sartori, a theorical physicist to derive in-depth analysis of the thermodynamics of the viral liquid condensates in the in vitro reconstituted system and compare it to results obtained in the cell. This will provide means to establish comparisons. We think that cells have too many variables to derive meaningful physics parameters (such as entropy and enthalpy) and models that need to be complemented by in vitro systems. For example, increasing the concentration inside a cell is not a simple endeavour as it relies on cellular pathways to deliver material to a specific place. At the same time, the 8 vRNPs, as mentioned above, have different size, valency and RNA sequence and can behave very differently in the formation of condensates and maintenance of their material properties. Ideally, they should be analysed individually or in selected combinations. For the future, we will combine data from in vitro reconstitution systems and cells to address this very important point raised by the reviewer.

      From the paper on the section ‘Limitations of the study’:

      “Understanding condensate biology in living cells is physiological relevant but complex because the systems are heterotypic and away from equilibria. This is especially challenging for influenza A liquid inclusions that are formed by 8 different vRNP complexes, which although sharing the same structure, vary in length, valency, and RNA sequence. In addition, liquid inclusions result from an incompletely understood interactome where vRNPs engage in multiple and distinct intersegment interactions bridging cognate vRNP-Rab11 units on flexible membranes (Chou et al., 2013, Gavazzi et al., 2013, Sugita et al., 2013, Shafiuddin and Boon, 2019, Haralampiev et al., 2020, Le Sage et al., 2020). At present, we lack an in vitro reconstitution system to understand the underlying mechanism governing demixing of vRNP-Rab11a-host membranes from the cytosol. This in vitro system would be useful to explore how the different segments independently modulate the material properties of inclusions, explore if condensates are sites of IAV genome assembly, determine thermodynamic values, thresholds accurately, perform rheological measurements for viscosity and elasticity and validate our findings. The results could be compared to those obtained in cell systems to derive thermodynamic principles happening in a complex system away from equilibrium. Using cells to map how liquid inclusions respond to different perturbations provide the answer of how the system adapts in vivo, but has limitations.

      Reviewer #2 (Public Review):

      During Influenza virus infection, newly synthesized viral ribonucleoproteins (vRNPs) form cytosolic condensates, postulated as viral genome assembly sites and having liquid properties. vRNP accumulation in liquid viral inclusions requires its association with the cellular protein Rab11a directly via the viral polymerase subunit PB2. Etibor et al. investigate and compare the contributions of entropy, concentration, and valency/strength/type of interactions, on the properties of the vRNP condensates. For this, they subjected infected cells to the following perturbations: temperature variation (4, 37, and 42{degree sign}C), the concentration of viral inclusion drivers (vRNPs and Rab11a), and the number or strength of interactions between vRNPs using nucleozin a well-characterized vRNP sticker. Lowering the temperature (i.e. decreasing the entropic contribution) leads to a mild growth of condensates that does not significantly impact their stability. Altering the concentration of drivers of IAV inclusions impact their size but not their material properties. The most spectacular effect on condensates was observed using nucleozin. The drug dramatically stabilizes vRNP inclusions acting as a condensate hardener. Using a mouse model of influenza infection, the authors provide evidence that the activity of nucleozin is retained in vivo. Finally, using a mass spectrometry approach, they show that the drug affects vRNP solubility in a Rab11a-dependent manner without altering the host proteome profile

      The data are compelling and support the idea that drugs that affect the material properties of viral condensates could constitute a new family of antiviral molecules as already described for the respiratory syncytial virus (Risso Ballester et al. Nature. 2021)

      Nevertheless, there are some limitations in the study. Several of them are mentioned in a dedicated paragraph at the end of a discussion. This includes the heterogeneity of the system (vRNP of different sizes, interactions between viral and cellular partners far from being understood), which is far from equilibrium, and the absence of minimal in vitro systems that would be useful to further characterize the thermodynamic and the material properties of the condensates.

      There are other ones.

      We thank reviewer 2 for highlighting specific details that need improving and raising such interesting questions to validate our findings. We have addressed the comments of Reviewer 2, we performed the experiments as described (in blue) below each point raised.

      1) The concentrations are mostly evaluated using antibodies. This may be correct for Cdilute. However, measurement of Cdense should be viewed with caution as the antibodies may have some difficulty accessing the inner of the condensates (as already shown in other systems), and this access may depend on some condensate properties (which may evolve along the infection). This might induce artifactual trends in some graphs (as seen in panel 2c), which could, in turn, affect the calculation of some thermodynamic parameters.

      The concern of using antibodies to calculate Cdense is valid, and we thought it was very important. We addressed this concern by performing the same analyses using a fluorescent tagged virus that has mNeon Green fused to the viral polymerase PA (PA-mNeonGreen PR8 virus). Like NP, PA is a component of vRNPs and labels viral inclusions, colocalising with Rab11 when vRNPs are in the cytosol. However, per vRNP there is only one molecule of PA, whilst of NP there are 37-96 depending on the size of vRNPs. As predicted, we did observe changes in the Cdilute, Cdense and nucleation density. However, the measurements and values obtained for Gibbs free energy, size, aspect ratio detecting viral inclusions with fluorescently tagged vRNPs or antibody staining followed the same trend and allow us to validate our conclusion that major changes in Gibbs free energy occur solely when there is a change in the valency/strength of interactions but not in temperature or concentration (Figure 1 below). Given the extent of these data, we show here the results but, in the manuscript, we will describe the limitations of using antibodies in our study within the section ‘Limitations of the study’ from lines 881-894. Given the importance of the question regarding the pros and cons of the different systems for analysing thermodynamic parameters, we have decided to systematically assess and explore these differences in detail in a future manuscript.

      For more information. This reviewer may be asking why we did not use the PA-fluorescent virus in the first place to evaluate inclusion thermodynamics and avoid problems in accessibility that antibodies may have to get deep into large inclusions. Our answer is that no system is perfect. In the case of the PA-fluorescent virus, the caveats revolve around the fact that the virus is attenuated (Figure 1a below), exhibiting a delayed infection as demonstrated by reduced levels of viral proteins (Figure 1b below). Consistently, it shows differences in the accumulation of vRNPs in the cytosol and viral inclusions form later in infection and the amount of vRNPs in the cytosol does not reach the levels observed in PR8-WT virus. After their emergence, inclusions behave as in the wild-type virus (PR8-WT), fusing and dividing (Figure 1c below) and displaying liquid properties.

      As the overarching goal of this manuscript is to evaluate the best strategies to harden liquid IAV inclusions and given that one of the parameters we were testing is concentration, we reasoned that using PR8-WT virus for our analyses would be reasonable.

      In conclusions, both systems have caveats that are important to systematically assess, and these differences may shift or alter thermodynamic parameters such as nucleation density, inclusion maturation rate, Cdense, Cdilute in particular by varying the total concentration. As a note, to validate all our results using the PA-mNeonGreen PR8 virus, we considered the delayed kinetics and applied our thermodynamic analyses up to 20 hpi rather than 16 hpi.

      However, because of the question raised by this reviewer, on which is the best solution for mitigating errors induced by using antibodies, we re-checked all our data. Not only have we compared the data originated from attenuated fluorescently tagged virus with our data, but also made comparisons with images acquired from Z stacks (as used for concentration and for type/strength of interactions) with those acquired from 2D images. Our analysis revealed that there is a very good match using images acquired with Z-stacks and analysed as Z projections with between antibody staining and vRNP fluorescent virus. Therefore, we re-analysed all our thermodynamic data done with temperature using images acquired from Z stacks and altered entirely Figure 2. We believe that all these comparisons and analyses have greatly improved the manuscript and hence we thank all reviewers for their input.

      Figure 1 – The PA-mNeonGreen virus is attenuated in comparison to the WT virus and data obtained is consistent for Gibbs free energy with analyses done with images processed with antibody fluorescent vRNPs. A. Representation of the PA-mNeonGreen virus (PA-mNG; Abbreviations: NCR: non coding region). B. Cells (A549) were transfected with a plasmid encoding mCherry-NP and co-infected with PA-mNeonGreen virus for 16h, at an MOI of 10. Cells were imaged under time-lapse conditions starting at 16 hpi. White boxes highlight vRNPs/viral inclusions in the cytoplasm in the individual frames. The dashed white and yellow lines mark the cell nucleus and the cell periphery, respectively. The yellow arrows indicate the fission/fusion events and movement of vRNPs/ viral inclusions. Bar = 10 µm. Bar in insets = 2 µm. C-D. Cells (A549) were infected or mock-infected with PR8 WT or PA-mNG viruses, at a multiplicity of infection (MOI) of 3, for the indicated times. C. Viral production was determined by plaque assay and plotted as plaque forming units (PFU) per milliliter (mL) ± standard error of the mean (SEM). Data are a pool from 2 independent experiments. D. The levels of viral PA, NP and M2 proteins and actin in cell lysates at the indicated time points were determined by western blotting. (E-G) Biophysical calculations in cells infected with the PA-mNeonGreen virus upon altering temperature (at 10 hpi, evaluating the concentration of vRNPs (over a time course) in conditions expressing native amounts of Rab11a or overexpressing low levels of Rab11a and upon altering the type/strength of vRNP interactions by adding nucleozin at 10 hpi during the indicated time periods. All data: Ccytoplasm/Cnucleus; Cdense, Cdilute, area aspect ratio and Gibbs free energy are represented as boxplots. Above each boxplot, same letters indicate no significant difference between them, while different letters indicate a statistical significance at α = 0.05 using one-way ANOVA, followed by Tukey multiple comparisons of means for parametric analysis, or Kruskal-Wallis Bonferroni treatment for non-parametric analysis.

      2) Although the authors have demonstrated that vRNP condensates exhibit several key characteristics of liquid condensates (they fuse and divide, they dissolve upon hypotonic shock or upon incubation with 1,6-hexanediol, FRAP experiments are consistent with a liquid nature), their aspect ratio (with a median above 1.4) is much higher than the aspect ratio observed for other cellular or viral liquid compartments. This is intriguing and might be discussed.

      IAV inclusions have been shown to interact with microtubules and the endoplasmic reticulum, that confers movement, and undergo fusion and fission events. We propose that these interactions and movement impose strength and deform inclusions making them less spherical. To validate this assumption, we compared the aspect ratio of viral inclusions in the absence and presence of nocodazole (that abrogates microtubule-based movement). The data in figure 2 shows that in the presence of nocodazole, the aspect ratio decreases from 1.42±0.36 to 1.26 ±0.17, supporting our assumption.

      Figure 2 – Treatment with nocodazole reduces the aspect ratio of influenza A virus inclusions. Cells (A549) were infected with PR8 WT for 8 h and treated with nocodazole (10 µg/mL) for 2h, after which the movement of influenza A virus inclusions was captured by live cell imaging. Viral inclusions were segmented, and the aspect ratio measured by imageJ, analysed and plotted in R.

      3) Similarly, the fusion event presented at the bottom of figure 3I is dubious. It might as well be an aggregation of condensates without fusion.

      We have changed this (check Fig 5A and B in the manuscript), thank you for the suggestion.

      4) The authors could have more systematically performed FRAP/FLAPh experiments on cells expressing fluorescent versions of both NP and Rab11a to investigate the influence of condensate size, time after infection, or global concentrations of Rab11a in the cell (using the total fluorescence of overexpressed GFP-Rab11a as a proxy) on condensate properties.

      We have included a new figure, figure 5 with the suggested data.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper evaluates the effect of knocking out CST7(Cystatin 5) on the APPNL-G-F Alzheimer's disease mouse model. They found sexually dimorphic outcomes, with differential transcriptional responses, increased phagocytosis (but interestingly a higher plaque burden) in females and suppressed inflammatory microglial activation in males (but interestingly no change in plaque burden). This study offers new insight into the functional role of CST7 that is upregulated in a subset of disease- associated microglia in AD models and human brain. Despite the discovery of disease-associated microglia several years ago, there has been little effort in understanding the function of the different genes that make up this profile, making this paper especially timely. Overall, the experiments are well-controlled and the data support the main conclusions and the manuscript could be strengthened by addressing the below comments and clarifying questions that could impact the interpretation of their data/ findings.

      1) In the first section discussing CST7 expression levels in AD models, it would be good to involve a discussion of levels of CST7 change in human AD samples. There are sufficient available datasets to look at this, and it would help us understand how comparable the animal models are to human patients. For example, while in mice CST7 is highly enriched in microglia/macrophages, in human datasets it seems like it is not quite so specific to microglia - it is equally expressed in endothelial cells. This might have a significant impact on the interpretation of the data, and it would be good to introduce and assess the findings in mice through the human subjects lens. There is a discussion of the human data in the discussion section, but it would be more appropriately assessed in the same way as the mouse data and comparatively presented in the results section. The authors could also include the data from Gerrits et al. 2021 in their first figure.

      We agree with the reviewer on the importance of considering the work in the context of human disease. While CST7 is not as strongly upregulated in human AD brain as it is in mouse expression is observed predominantly in myeloid cells in the brain with very minimal expression detected in endothelial cells (see screenshots in Author response image 1 from Brain Myeloid Landscape platform (http://research-pub.gene.com/BrainMyeloidLandscape/BrainMyeloidLandscape2/) and is enriched in AD clusters vs homeostatic in scRNASeq studies (Gerrits et al., 2021). We attempted immunostaining for human CF (CST7) in AD brains to assess expression and co-localisation with microglial markers but failed to validate any of the antibodies tested. Additionally, King et al., 2023 (PMID: 36547260) recently showed increase in CST7 expression in bulk hippocampal RNASeq in AD vs mid-life controls suggesting an ageing/AD mechanism. CST7 has also been shown to be expressed following overexpression of TREM2 in human microglia in vitro and that siRNA-mediated knockdown of expression leads to an increase in phagocytosis (Popescu et al., 2023 - PMID: 36480007), mirroring our data and suggesting a conserved role in human cells. Overall, we believe that, even in the context of mouse models, the understanding of the function of genes upregulated in disease is of importance to the field and that this study paves the way for further work investigating human CST7 in disease. We have added this (with citations to the datasets mentioned) to the discussion (highlighted).

      Author response image 1

      2) The differential RNAseq data is perhaps one of the most striking results of this paper; however it is difficult to see exactly how similar the male v female APPNL-G-F profiles are, in addition to the genes shared or not between the KO condition. Venn diagrams, in addition to statistical tests, would enhance this part of the paper and add more clarity.

      We have added Venn diagrams to show DEGs between male and female AppNL-G-F microglia vs WT control to show how similar the male v female APPNL-G-F profiles are. Additionally, to exemplify the Cst7KO-Sex interaction, a Venn showing DEGs between male and female AppNL-G-F microglia vs. AppNL-G-FCst7-/- microglia (Fig. 2 – Fig. supplement 3). We confirm we have derived all differential gene expression changes reported (including those represented in the Venn diagrams) using appropriate Padj statistical approaches (see Methods).

      3) A major argument in the paper is a continuation of Sala-Frigerio 2019 which says that the female phenotype is an acceleration of the male phenotype. Does this mean that if males were assessed at later timepoints, they would be more similar to the females? Or are there intrinsic differences that never resolve? It would be helpful to see a later timepoint for males to get at the difference between these two options

      This is an interesting question and while we acknowledge that empirically addressing with a later timepoint could add insight, we believe it would actually need multiple closely-spaced timepoints as choosing what single later timepoint would be optimal is difficult to judge (and likely not possible at all) for reasons below. We also believe data already published combined with our observations show it is most-likely a cell-intrinsic effect that explains our sex-specific differences.

      First, we emphasize the acceleration of the microglial phenotype in female AppNL-G-F mice previously published is fairly subtle and relative rather than absolute e.g. the DAM/ARM microglia state represents ~50% of all microglia in male and ~55% of all microglia in females at 12 months old therefore both sexes have similarly abundant microglia in the state that most highly express Cst7. Indeed, after the age at which DAM/ARM state microglia appear in appreciable numbers (~ 6 months), both females and males both have an abundance of them. It is important to note that a 12-month male is far more “progressed” than a 6-month female hence the stepped age effect is temporally short.

      Second, Cst7 deletion in the AppNL-G-F mice condition caused qualitative differences affecting distinct genes and/or overlapping genes moving in different directions between female and male mice - if a stepped age effect explained sex differences from Cst7 deletion, given that it could only be stepped by a very short timeframe (several weeks maximum) from reasoning above, we would expect to see similar qualitative changes but of different magnitude in female and male mice arising from Cst7 deletion; this is not the pattern we see.

      Third, beyond 12 months old, regression from ARM/DAM actually occurs, again making it unlikely males would “catch up” with females to show the same profile from Cst7 deletion but just at an older age – practically, this also complicates choosing a single later timepoint (and age-related systemic morbidity emerges as a potential confounder as well).

      In summary, while the acceleration of the DAM signature in female microglia offers an intriguing possible explanation to our observation of sexual dimorphism in response to deletion of one of the key genes in this signature, we believe it more likely that intrinsic effects are responsible for the Cst7 deletion sex-related impact. Taking the alternative perspective, even if a stepped age effect in the underlying progression of the model could explain our findings, this would need multiple timepoints with short gaps between (e.g. monthly at 12, 13, 14, 15 months old) to provide the temporal resolution to expose this pattern; we would not have the resources to conduct such a resource-intensive and lengthy study. We hope this reasoning appears logical and conscious of the importance to convey this in our manuscript we have revised the Discussion to as concisely as possible capture some key points outlined above.

      4) If the central argument is that CST7 in females decreases phagocytosis and in males increases microglia activation, are there changes in amyloid plaque burden or structure in the APPNL-G-F /CST 7 KO mice compared to APPNL-G-F/CST7 WT that reflect these changes? Please address. If not, how does this affect the functional interpretation of differential expression observed in phagocytic/reactive microglia genes? Pieces of this are discussed but it could be clearer.

      We emphasise the data already presented in Fig 6 and Fig. 6 – Fig. Supplement 2 showing altered Aβ burden (6E10 staining) and plaque count (MeX04) but no change in plaque area. Regarding the functional interpretation of Cst7-dependent gene changes in microglia beyond the endolysosomal function we present in figures 3-5, we have included additional data using simple immunohistochemistry, as suggested by the reviewer, to assess synapse abundance. We show loss of Sy38 coverage around plaques (Fig. 6I) and a moderate but significant decrease in coverage between AppNL-G-F/Cst7-/- vs AppNL-G-F brains only in females (Fig. 6J). This reflects the effect observed with plaque coverage whereby we observe increased burden in AppNL-G-F/Cst7-/- vs AppNL-G-F females but not males (Fig. 6B-F) suggesting the increased plaque burden in Cst7-/- female mice may lead to increased synapse loss. We would also emphasise that altered expression of phagolysosomal genes could affect disease in ways beyond interactions with amyloid and synapses.

      5) It is confusing that increased phagocytosis in the APPNL-G-F/CST7 KO females leads to greater plaque burden, considering proteolysis is not affected. What might explain this observation? Additionally, it is interesting that suppression of microglial activation doesn't lead to an increase in plaques in the male APPNL-G-F/CST7 KO mice. How does the profile of phagocytic microglia in the male APPNL-G-F/CST7 KO mice differ from the APPNL-G-F males?

      We emphasize our comments on this topic in the discussion where we speculate that the greater plaque burden in females is linked to increased uptake of Aβ (which we observe in Fig. 4B&C) and deposition into plaques as suggested by Huang et al., 2021 (PMID: 33859405), d’Errico et al., 2022 (PMID: 34811521) and Shabestari et al., 2022 (PMID: 35705056). Regarding the lack of effect in males despite the suppression of inflammatory genes, we agree this is a curious observation, although may point to as yet ill-defined mechanisms for how inflammatory pathways influence plaque pathology. Unfortunately, we were not able to specifically compare the profile of phagocytic microglia in AppNL-G-F vs AppNL-G-FCst7-/- as we did not perform single-cell RNASeq. However, our bulk RNASeq profiling suggests modest downregulation of phagocytic/endolysosomal genes (eg Lilrb4a, Fig. 2I) and reduced expression of LAMP2 in microglia by immunostaining. We have added further comment on this in the discussion.

      6) Seems that the authors have potentially discovered an unusual mechanism for how CST7 could regulate cell autonomous function without impacting its canonical protease target. The authors deal with this extensively in the discussion but an ELISA or ICC to localize CST7 to microglia in vitro or in vitro would help address this point.

      We have added FISH data localising Cst7 expression to IBA1+ cells specifically around plaques in App brains (Fig. 1B-E). We agree that assessing the subcellular localisation and any non-microglial expression of Cystatin-F (the protein coded by Cst7) would offer valuable insight into the protease target and may reveal details on the precise mechanism by which CF deletion leads the phenotype we observe in this study. However, despite attempting numerous commercially available and gifted antibodies to detect CF we were unable to validate (using Cst7-/- as controls) any methods other than FISH.

      7) The authors focus on plaques in their final figure, however dysregulated microglial phagocytosis could impact many other aspects of brain health. Simple immunohistochemistry for synapses and myelin/oligodendrocytes (especially given the results of the in vitro phagocytosis assay) could provide more insight here.

      We fully agree with the reviewer. As also outlined in our responses elsewhere, phagocytic changes could have multiple consequences, and we have included additional data using immunohistochemistry as advised for synapses in WT, AppNL-G-F, and AppNL-G-F/Cst7-/- brains. We show loss of Sy38 coverage around plaques (Fig. 6I) and a moderate but significant decrease in coverage between AppNL-G-F/Cst7-/- vs AppNL-G-F brains only in females (Fig. 6J). This reflects the effect observed with plaque coverage whereby we observe increased burden in AppNL-G-F/Cst7-/- vs AppNL-G-F females but not males (Fig. 6B-F) suggesting the increased plaque burden in Cst7-/- female mice may lead to increased synapse loss.

      We also performed immunohistochemistry for myelin makers MAG and MBP but found no plaque-associated pathology. Finally, we searched for dystrophic neurites using LAMP1 but found that the antibody stained microglial lysosomes rather than dystrophic neurites in this model (see Author response image 2), an observation that has been made by others (Sharoar et al., 2021 - PMID: 34215298).

      Overall, our data suggest Cst7 may play a protective role in females, limiting phagocytosis, reducing plaque burden and blunting synapse loss.

      Author response image 2.

      Reviewer #3 (Public Review):

      In this manuscript, Daniels et al explored the role of Cystatin F in an A-driven mouse model of Alzheimer's disease. By crossing a constitutive knockout mouse lacking the gene that encodes Cystatin F, Cst7, to the AppNL-G-F mouse line, the authors describe impairments in microglial gene expression and phagocytic function that emerge more prominently in females versus males lacking Cst7. A strength of the study is its focus: given mounting evidence that microglia are a hub of neurological dysfunction with particular potential to trigger or exacerbate neurodegenerative disorders, it is essential to determine the changes in microglia that occur pathologically to promote disease progression. Similarly, the wide-spread identification of the gene in question, Cst7, as upregulated in AD models makes this gene a good target for mechanistic studies.

      The paper in its current form also has several weaknesses which limit the insights derived, weaknesses that are largely related to the experimental tools and approaches chosen by the authors to test their hypotheses. For example, the paper begins with a figure replotting data from previous studies showing that Cst7 is upregulated in mouse models of Alzheimer's disease. Though relevant to the current study, there are no new insights provided here. Next, the authors perform bulk RNA-sequencing on microglia isolated from male and female mice in the Cst7-/-; AppNL-G-F mouse line. In the methods, it is unclear whether the authors took precautions to preserve the endogenous transcriptional state of these cells given evidence that microglia can acquire a DAM-like signature simply due to the process of dissociation (Marsh et al, Nature Neuroscience, 2022). If the authors did not control for this, their results may not support the conclusions they draw from the data. Relatedly, it appears the authors pooled all microglia together here, instead of just isolating DAMs specifically or analyzing microglia at single-cell resolution, which could reveal the heterogeneous nature of the role of Cst7 in microglia. In addition to losing information about heterogeneity, another concern is that they could be diluting out the major effects of the model on microglial function by including all microglia. Overall, the biggest issue I have with the RNA-sequencing data is the lack of validation of the gene expression changes identified using a different method that does not require dissociation, like immunohistochemistry or fluorescence in situ hybridization. Especially given the limited number of genes they found to be mis-regulated (see Fig. 2 E and G), I worry that these changes might simply be noise, especially since the authors provide no further evidence of their mis-regulation. Without further validation, the data presented are not sufficient to support the authors' claims.

      We believe we have addressed this comment in the “Essential Revisions (for the authors)” section above. Please see again below:

      We took standard precautions to minimise the risk of aberrant ex vivo cell activation, including maintaining cells on ice during non-enzyme steps of the procedure and carrying out preps in small batches to minimise time taken from removal of brain to purification of microglial RNA. Importantly, we also validated key expression data by in situ methods such as RNA FISH for Cst7 and Lilrb4a (Fig. 1B-E, Fig 2. - Fig. supplement 3) thus eliminating dissection-induced effects. Additionally, when performing qPCR on microglia from non-disease mice to test the disease-specific role of Cst7-dependent gene regulation we did not observe the same gene changes (Fig 2. - Fig. supplement 4) which, if such changes were dependent on tissue dissociation, we would expect to observe in WT or disease animals. We utilised the resources provided by Marsh et al. 2022 to search for overlap between enzyme-induced genes and our DEG lists from our key comparisons. We found the enzyme-induced gene set had very minimal overlap with any of our comparisons with overlap of only 4 genes between enzyme-induced genes and Cst7-dependent genes in males and no overlap between enzyme-induced genes and Cst7-dependent genes in females. We would further point out that the disease-induced microglial RNAseq profile in the AppNL-G-F Cst7+/+ (i.e. disease WT) condition mirrors those observed previously by multiple methods including in situ profiling (Zeng et al 2023 - PMID: 36732642) and RiboTag approaches (Kang et al 2018 - PMID: 30082275). We believe these combined approaches provide convincing validation of the RNAseq data.

      In assessing the changes in microglial function and A pathology that occur in males and females of the Cst7-/-; AppNL-G-F line, the authors identify some differences between how females and males are affected by the loss of Cst7. While the statistical analyses the authors perform as given in the figure legends appear to be correct, the plots do not show significant changes between males and females for a given parameter. Take for example Figure 3H. Loss of Cst7 decreases IBA+Lamp+ microglia in males but increases this parameter in females. However, it does not appear that there is a significant difference in IBA+Lamp+ microglia in male versus female mice lacking Cst7. If there is no absolute difference between males and females, can the differential effects of Cst7 knockout on the sexes really be so relevant to the sexual dimorphism observed in the disease? I question this connection, but perhaps a greater discussion of what the result might mean by the authors would be helpful for placing this into context.

      We understand the reviewer’s perspective and we agree that the interpretations could be presented and explained better in the text - we have updated the discussion as suggested to address this.

      We designed our study initially to search for sex-specific effects of Cst7. Therefore, whilst our ANOVA does include main effects analysis for disease or sex, we carried out post-hoc analysis primarily to investigate effects of Cst7 deletion within sex. In the case of Fig. 3H pointed out by the reviewer, we observe a main effect for disease in the ANOVA and for disease-sex interaction but not for sex. Post-hoc analysis revealed the sex-specific effects of Cst7 we describe in the manuscript. This approach on analysis was also taken by Hoghooghi et al. (2020 - PMID: 33027652) who show related pathway gene Cstc is detrimental in EAE in females but not males (included in the discussion in this manuscript). The observation in Fig. 3H that there appears to be a Cst7 effect in males and females but not a sex effect in Cst7-/- is accurate but a relative anomaly in this study. Generally, we find that, alongside Cst7 deletion affecting females differently to males, we also see a sex effect in Cst7-/- animals but not in Cst7+/+ animals i.e. absolute levels in disease condition as well as relative changes from control to disease condition are different between males and females. This is exemplified in Fig. 4B&C where we observe increased microglial Aβ in female Cst7-/- animals vs male Cst7-/- animals and in Fig. 6D where we observe increased Aβ plaque burden in female Cst7-/- animals vs male Cst7-/- animals. This is most strikingly demonstrated in the case of our RNASeq data where we observe a difference in sex-dependent genes in AppNL-G-F vs AppNL-G-F/Cst7-/- (Fig. 2 – Fig. supplement 3B) implying removal of the Cst7 gene led to an ‘unlocking’ of sexual dimorphism in our cohort which we comment on in the discussion.

      Finally, the use of in vitro assays of microglial function can be helpful as secondary analyses when coupled with in vivo or ex vivo approaches, but are not on their own sufficient to support the authors' conclusions. Quantitative engulfment assays (see Schafer et al, Neuron, 2012) on brain tissue showing that male and female microglia lacking Cst7 engulf different amounts of material (e.g. plaques, synapses, myelin) in the intact brain would be more convincing.

      We agree that in vitro assays for microglial function are not always sufficient as standalone methods to support conclusions on functions in disease. The reviewer may have missed our in vivo MeX04 uptake assays (Fig 4A-D) which use measurements by flow cytometry on isolated microglia, this is a reflection of the microglial uptake in vivo following MeX04 injection pre-mortem – this experiment showed increased microglial Aβ in female Cst7-/- animals vs male Cst7-/- animals (Fig. 4B&C). Our in vitro assays complement and extend insight in ways not possible in vivo, for example they offer key insight into uptake/degradation kinetics that would be extremely challenging to carry out in vivo.

      In general, a major limitation to the insights that can be derived in the study is the decision of the authors to perform all experiments at a single late-stage time point of 12 months of age. As this is quite far into disease progression for many AD models, phenotypic changes identified by the authors could arise due to the downstream effects of plaque deposition and therefore may not implicate Cst7 as a mechanism driving neurodegeneration rather than one of many inflammatory changes that accompany AD mouse models nearing the one-year time point. A related problem is that the study uses a constitutive KO mouse that has lacked Cst7 expression throughout life, not just during disease processes that increase with aging. In summary, the topic of the article is important and timely, but the connection between the data and the authors' conclusions is not as strong as it could be.

      As described above, Cst7 expression is absent at steady-state and low until 6-12 months. Therefore, we predict that deletion would have little effect until 12+ months whereby cells expressing Cst7 have had the temporal window to affect disease pathology, as we find in the current study. This was a key part of the reasoning in our choice of the 12-month age for analyses. The negligible expression of Cst7 at baseline/early stages of disease suggests constitutive KO of the gene will not impact the phenotype until disease onset. This is substantiated by the lack of any genotype-related differences in the WT vs Cst7-/- comparisons in the non-disease condition.

    1. Author Response:

      Reviewer #1:

      The paper uses a microfluidic-based method of cell volume measurement to examine single cell volume dynamics during cell spreading and osmotic shocks. The paper successfully shows that the cell volume is largely maintained during cell spreading, but small volume changes depend on the rate of cell deformation during spreading, and cell ionic homeostasis. Specifically, the major conclusion that there is a mechano-osmotic coupling between cell shape and cell osmotic regulation, I think, is correct. Moreover, the observation that fast deforming cell has a larger volume change is informative.

      The authors examined a large number of conditions and variables. It's a paper rich in data and general insights. The detailed mathematical model, and specific conclusions regarding the roles of ion channels and cytoskeleton, I believe, could be improved with further considerations.

      We thank the referee for the nice comment on our work and for the detailed suggestions for improving it.

      Major points of consideration are below.

      1) It would be very helpful if there is a discussion or validation of the FXm method accuracy. During spreading, the cell volume change is at most 10%. Is the method sufficiently accurate to consider 5-10% change? Some discussion about this would be useful for the reader.

      This is an important point and we are sorry if it was not made clear in our initial manuscript. We have now made it more clear in the text (p. 4 and Figure S1E and S1F).

      The important point is that the absolute accuracy of the volume measure is indeed in the 5 to 10% range, but the relative precision (repeated measures on the same cell) is much higher, rather in the 1% range, as detailed below based on experimental measures.

      1) Accuracy of absolute volume measurements. The accuracy of the absolute measure of the volume depends on several parameters which can vary from one experiment to the other: the exact height of the chamber, and the biological variability form one batch of cell to another (we found that the distribution of volumes in a population of cultured cells depends strongly on the details of the culture – seeding density, substrate, etc... - which we normalized as much as possible to reduce this variability, as described in previous articles, e.g. see2). To estimate this variability overall, the simplest is to compare the average volume of the cell population in different experiments, carried out in different chambers and on different days.

      Graph showing the initial average volume of cells +/- STD for 7 spreading experiments and 27 osmotic shock experiments, expressed as a % deviation from the average volume over all the experiments.

      The average deviation is of 10.9 +/- 8%

      2) Precision of relative volume measurements. When the same cell is imaged several times in a time-lapse experiment, as it is spreading on a substrate, or as it is swelling or shrinking during an osmotic shock, most of the variability occurring from one experiment to another does not apply. To experimentally assess the precision of the measure, we performed high time resolution (one image every 30 ms) volume measurements of 44 spread cells during 9 s. During this period of time, the volume of the cell should not change significantly, thus giving the precision of the measure.

      Graph showing the coefficient of variation of the volume (STD/mean) for each individual cell (n=44) across the almost 300 frames of the movie. This shows that on average the precision of volume measurements for the same cell is 0.97±0.21%. In addition, if more precision was needed, averaging several consecutive measures can further reduce the noise, a method which is very commonly used but that we did not have to apply to our dataset.

      We have included these results in the revised manuscript, since they might help the reader to estimate what can be obtained from this method of volume measurement. We also point the reviewer to previous research articles using this method and showing both population averages and time-lapse data2–8 . Another validation of our volume measurement method comes from the relative volume changes in response to osmotic shock (Ponder’s relation) measured with FXm, which gave results very similar to the numbers of previously published studies. We actually performed these experiments to validate our method, since the results are not novel.

      2) The role of cell active contraction (myosin dynamics) is completely neglected. The membrane tether tension results, LatA and Y-compound results all indicate that there is a large influence of myosin contraction during cell spreading. I think most would not be surprised by this. But the model has no contribution from cortical/cytoskeletal active stress. The authors are correct that the osmotic pressure is much larger than hydraulic pressure, which is related to active contraction. But near steady state volume, the osmotic pressure difference must be equal to hydraulic pressure difference, as demanded by thermodynamics. Therefore, near equilibrium they must be close to each other in magnitude. During cell spreading, water dynamics is near equilibrium (given the magnitude of volume change), and therefore is it conceptually correct to neglect myosin active contraction? BTW, 1 solute model does not imply equal osmolarity between cytoplasm and external media. 1 solute model with active contraction was considered before, e.g., ref. 17 and Tao, et al, Biophys. J. 2015, and the steady state solution gives hydraulic pressure difference equal to osmotic pressure difference.

      This is an excellent point raised by the referee. We have two types of answers for this. First an answer from an experimental point of view, which shows that acto-myosin contractility does not seem to play a direct role in the control of the cell volume, at least in the cells we used here. Based on these results we then propose a theoretical reason why this is the case. It contrasts with the view proposed in the articles mentioned by the referee for a reason which is not coming from the physical principles, with which we fully agree, but from the actual numbers, available in the literature, of the amount of the various types of osmolytes inside the cell. We give these points in more details below and we hope they will convince the referee. We also now mention them explicitly in the main text of the article (p. 6-7, Figure S3F) and in the Supplementary file with the model.

      A. Experimental results

      To test the effect of acto-myosin contraction on cell volume, we performed two experiments:

      1) We measured the volume of same cell before and after treatment with the Rho kinase ROCK inhibitor Y-27632, which decreases cortical contractility. The experiment was performed on cells plated on poly-L-Lysin (PLL), like osmotic shock experiments, a substrate on which cells adhere, allowing the change of solution, but do not spread and remain rounded. This allowed us to evaluate the effect of the drug. Cells were plated on PLL-coated glass. The change of medium itself (with control medium) induced a change of volume of less than 2%, similar to control osmotic shock experiments (maybe due to shear stress). When the cells were treated with Y-27, the change of volume was similar to the change with the control medium (now commented in the text p. 6-7, Figure S3F). To make the analysis more complete, we distinguished the cells that remained round throughout the experiment from the cells which slightly spread, since spreading could have an effect on volume. Indeed we observed that treatment with Y-27 induced more cells to spread (Figure S3F), probably because the cortex was less tensed, allowing the adhesive forces on PLL to induce more spreading9. Nevertheless, the spreading remained rather slow and the volume change of cells treated or not with Y-27 was not significantly different. This shows that, in the absence of fast spreading induced by Y-27, the reduction of contractility per se does not have any effect on the cell volume.

      Graphs showing proportion of cells that spread during the experiments (left); average relative volume of round (middle) and spread (right) control (N=3, n=77) and Y-27 treated cells (N=4, N=297).

      2) To evaluate the impact of a reduction of contractility in the total absence of adhesion, we measured the average volume of control cells versus cells which have been pretreated with Y-27, plated on a non-adhesive substrate (PLL-PEG treatment). This experiment showed that the volume of the cells evolved similarly in time for both conditions, proving that contractility per se has no effect on the cell volume or cell growth, in the absence of spreading.

      Graphs showing average relative volume of control (N=5, n=354) and Y-27 (N=3, n=292) treated cells plated on PLL-PEG (left); distributions of initial volume for control (middle) and Y-27 treated cells (right) represented on the left graph.

      Taken together these results show that inhibition of contractility per se does not significantly affect cell volume. It thus confirms our interpretation of our results on cell spreading that reduction of contractility has an effect on cell volume, specifically in the context of cell spreading, primarily because it affects the spreading speed.

      B. Theoretical interpretation

      In accordance with our experiments, in our model, the effect of contractility is implicitly included in the model because it modulates the spreading dynamics, which is an input to the model, i.e. through the parameters tau_a and A_0.

      We do not include the effect of contractility directly in the water transport equation because our quantitative estimates support that the contribution of the hydrostatic pressure to the volume (or the volume change) is negligible in comparison to the osmotic pressure, and this even for small variation near the steady-state volume. The main important point is that the concentration of ions inside the cell is actually much lower than outside of the cell10,11. The difference is about 100 mM and corresponds mostly to nonionic small trapped osmolytes, such as metabolites12. The osmotic pressure corresponding to this is about 10^5 Pa. Taking the cortical tension to be of order of 1 mN/m and cell size to be about ten microns we get a hydrostatic pressure difference of about 100 Pa due to cortical tension. A significant change in cell volume, of the order observed during cell spreading (let’s consider a ten percent decrease) will increase the osmotic pressure of the trapped nonionic osmolytes by 10^4 Pa (their number in the cell remaining identical). For this osmotic pressure to be balanced by an increase in the hydrostatic pressure, the cortical tension would need to increase by a factor of 100, which we consider to be unrealistic. Therefore, we find it reasonable to ignore the contribution of the hydrostatic pressure difference in the water flux equation. It is also consistent with the novel experiments presented above which show that inhibition of cortical contractility changes the cells volume below what can be detected by our measures (thus likely at maximum in the 1% range). This is now explained in the main text and Supplementary file.

      Regarding our minimal model required to define cell volume, the reason why we believe one solute model is not sufficient is fundamentally the same as above: the concentration of trapped osmolytes is comparable to the total osmolarity, which means that their contribution to the total osmotic pressure cannot be discarded. Secondly, within the simplest one solute model, the pump and leak dynamics fixes in inner osmolytes concentration but does not involve the actual cell size. The most natural term that depends on the size is the Laplace pressure (inversely proportional to the cell size in a spherical cell model). But as discussed above, this term may only permit osmotic pressure differences of the order of 100 Pa, corresponding to an osmolytes concentration difference of the order of 0.1 mM. That is only a tiny fraction of the external medium osmolarity, which is about 300 mM. Such a model could thus only work for extremely fine tuning of the pump and leak rates to values with less than about 1% variation. Furthermore, such a model could not explain finite volume changes upon osmotic shocks without involving huge (100-fold) cell surface tension variations, as discussed above. For these reasons, we believe that the one-solute model is not appropriate to describe our experiments, and we feel that a trapped population of nonionic osmolytes is needed to balance the osmolarity difference created by the solute pump and leak.

      In the revised version of the manuscript, we have now added a section in Supplementary file and in the main text, explaining in more detail this approximation.

      3) The authors considered the role of Na, K, and Cl in the model, and used pharmacological inhibitors of NHE exchanger. I think this part of the experiments and model are somewhat weak. I am not sure the conclusions drawn are robust. First there are many ion channels/pumps in regulating Na, K and Cl. The most important of which is NaK exchanger. NHE also involves H, and this is not in the model. The ion flux expressions in the model are also problematic. The authors correctly includes voltage and concentration dependences, but used a constant active term S_i in SM eq. 3 for active pumping. I am not sure this is correct. Ion pump fluxes have been studied and proposed expressions based on experimental data exist. A study of Na, K, Cl dynamics, and membrane voltage on cell volume dynamics was published in Yellen et al, Biophys. J. 2018. In that paper, they used different expressions based on previously proposed flux expressions. It might be correct that in small concentration differences, their expressions can be linearized or approximated to achieve similar expressions as here. But this point should be considered more carefully.

      We thank the reviewer for this comment. Indeed, we have not well justified our use of the NHE inhibitor EIPA. Our aim was not to directly affect the major ion pumps involved in volume regulation (which would indeed rather be the Na+/K+ exchanger), because that would likely strongly impact the initial volume of the cell and not only the volume response to spreading, making the interpretation more difficult. We based our choice on previous publication, e.g.13, showing that EIPA inhibited the main fast volume changes previously reported for cultured cells: it was shown to inhibit volume loss in spreading cells, as well as mitotic cell swelling14,15. Using EIPA, we also found that, while the initial volume was only slightly affected, the volume loss was completely abolished even in fast spreading cells (Y-27 and EIPA combined treatment, Figure S5H). This clearly proves that the volume loss behavior can be abolished, without changing the speed of spreading, which was our main aim with this experiment.

      The most direct effect of inhibiting NHE exchangers is to change the cell pH16,17, which, given the low number of H protons in the cell (negligible contribution to cells osmotic pressure), cannot affect the cell volume directly. A well-studied mechanism through which proton transport can have indirect effect on cell volume is through the effect of pH on ion transporters or due to the coupling between NHE and HCO3/Cl exchanger. The latter case is well studied in the literature18. In brief, the flux of proton out of the cell through the NHE due to Na gradient leads to an outflux of HC03 and an influx of Cl. The change in Cl concentration will have an effect on the osmolarity and cell volume.

      We thus performed hyperosmotic shocks with this drug and we found that, as expected, it had no effect on the immediate volume change (the Ponder’s relation), but affected the rate of volume recovery (combined with cell growth). Overall, the cells treated with EIPA showed a faster volume increase, which is what is expected if active pumping rate is reduced. This is in contrast with the above mentioned mechanism of volume regulation which will to lead to a reduced volume recovery of EIPA treated cells. This leads us to conclude that there is potentially another effect of NHE perturbation. Changing the pH will have a large impact on the functioning of many other processes, in particular, it can have an effect on ion transport16. Overall, the cells treated with EIPA showed a faster volume increase, which is what is expected if active pumping rate is reduced.

      On the model side, the referee correctly points out that there are many ion transporters that are known to play a role in volume regulation which are not included in Eq. 3. In the revised manuscript we now start with a more general ion transport equation. We show that the main equation (Eq.1 - or Supplementary file Eq.13) relating volume change to tension is not affected by this generalization. This is because we consider only the linear relation between the small changes in volume and tension. We note that the generic description of the PML (Supplementary file Eqs.1-6) can be seen as general and does not require the pump and channel rates to be constant; both \Lambda_i and S_i can be a function of potential and ion concentration along with membrane tension. It is only later in the analysis that we do make the assumption that these parameters only depend on tension. This point is now made clear in the Supplementary file.

      There is a huge body of work both theoretical and experimental in which the effect of different ion transporters on cell volume is analyzed. The aim of this work is not to provide an analysis of cell volume and the effect of various co-transporters but is rather limited to understanding the coupling between cell spreading, surface tension and cell volume.

      To analytically estimate the sign of the mechano-osmotic coupling parameter alpha we use a minimal model. For this we indeed take the pumps and channels to be constant. As it is again a perturbative expansion around the steady state concentration, electric potential, and volume, the expression of alpha can be easily computed for a model with more general ion transporters. This generalization will come at the cost of additional parameters in the alpha expression. We decided to keep the simpler transport model, the goal of this estimate is merely to show that the sign of alpha is not a given and depends on relative values of parameters. Even for the simple model we present, the sign of alpha could be changed by varying parameters within reasonable ranges.

      Given these points, and the clarification of the reasons to use EIPA in our experiments, a full mechanistic explanation of the effect of this drug is beyond the scope of this work. Because of this we are not analyzing the effect of EIPA on the model parameter alpha in detail. We now clarified our interpretation of these results in the main text of the article.

      Reviewer #2:

      The work by Venkova et al. addresses the role of plasma membrane tension in cell volume regulation. The authors study how different processes that exert mechanical stress on cells affect cell volume regulation, including cell spreading, cell confinement and osmotic shock experiments. They use live cell imaging, FXm (cell volume) and AFM measurements and perform a comparative approach using different cell lines. As a key result the authors find that volume regulation is associated with cell spreading rate rather than absolute spreading area. Pharmacological assays further identified Arp2/3 and NHE1 as molecular regulators of volume loss during cell spreading. The authors present a modified mechano-osmotic pump and leak model (PLM) based on the assumption of a mechanosensitive regulation of ion flux that controls cell volume.

      This work presents interesting data and theoretical modelling that contribute new insight into the mechanisms of cell volume regulation.

      We thank the referee for the nice comments on our work. We really appreciate the effort (s)he made to help us improve our article, including the careful inspection of the figures. We think our work is much improved thanks to his/her input.

      Reviewer #3:

      The study by Venkova and co-workers studies the coupling between cell volume and the osmotic balance of the cell. Of course, a lot of work as already been done on this subject, but the main specific contribution of this work is to study the fast dynamics of volume changes after several types of perturbations (osmotic shocks, cell spreading, and cell compression). The combination of volume dynamics at very high time resolution, and the robust fits obtained from an adapted Pump and Leak Model (PLM) makes the article a step-forward in our understanding of how cell volume is regulated during cell deformations. The authors clearly show that:

      -The rate at which cell deforms directly impacts the volume change

      -Below a certain deformation rate (either by cell spreading or external compression), the cells adapt fast enough not to change their volume. The plot dV/dt vs dA/dt shows a clear proportionality relation.

      -The theoretical description of volume change dynamics with the extended PLM makes the overall conclusions very solid.

      Overall the paper is very well written, contains an impressive amount of quantitative data, comparing several cell types and physiological and artificial conditions.

      We thank the referee for the positive comment on our work.

      My main concern about this study is related to the role of membrane tension. In the PLM model, the coupling of cell osmosis to cell deformation is made through the membrane-tension dependent activity of ion channels. While the role of ion channels is extensively tested, it brings some surprising results. Moreover, the tension is measured only at fixed time points, and the comparison to theoretical predictions is not always as convincing as expected: when comparing fig 6I and 6J, I see that predictions shows that EIPA (+ or - Y27), CK-666 (+ or - Y27) and Y27 alone should have lower tension than in the control conditions, and this is clearly not the case in fig 6J. But I would not like to emphasize too much on those discrepancies, as the drugs in the real case must have broad effects that may not be directly comparable to the theory.

      We apologize for the mislabeling of the Figure 6I (now Figure 5I). This plot shows the theoretical estimate for the difference in tension (in the units of homeostatic tension) between the case when the cell loses its volume upon spreading (as observed in experiments) compared to the hypothetical situation when the cell does not lose volume upon spreading (alpha = 0). The positive value of the tension difference predicts that the cell tension would have been higher if the cell were not losing volume upon spreading, which is the case for the treatments with EIPA and CK-666 (+ Y27) and corresponds to what we found experimentally.

      It thus matches our experimental observations for drug treatments which reduce or abolish the volume loss during spreading and correspond to higher tether force only at short time.

      We have corrected the figure and figure legend and explained it better in the text.

      But I wonder if the authors would have a better time showing that the dynamics of tension are as predicted by theory in the first place, as comparing theoretical predictions with experiments using drugs with pleiotropic effects may be hazardous.

      Actually, a recent publication (https://doi.org/10.1101/2021.01.22.427801) shows that tension follows volume changes during osmotic shocks, and overall find the same dynamics of volume changes than in this manuscript. I am thus wondering if the authors could use the same technique than describe in this paper (FLIM of flipper probe) in order to study the dynamics of tension in their system, or at least refer to this paper in order to support their claim that tension is the coupling factor between volume and deformation.

      As was suggested by the referee, we tried to use the FLIPPER probe. We first tried to reproduce osmotic shock experiments adding to the HeLa cells 4% of PEG400 (+~200 mOsm) or 50% of H20 (-~170 mOsm) and measuring the average probe lifetime before and after the shock. We found significantly lower probe lifetime for hyperosmotic condition compared with control, and non-significant, but slightly higher lifetime for hypoosmotic shock. The magnitude of lifetime changes was comparable with the study cited by the reviewer, but the quality of our measures did not allow us to have a better resolution. Next we measured average lifetime for control and CK-666+Y-27 treated cells 30 min and 3 h after plating, because we have highest tether force values for CK-666+Y-27 at 30 min. We did not see a change in lifetime in control cells between 30 min and 3 h (which also did not see with the tether pulling). Cells treated with CK-666+Y-27 showed a slightly lower lifetime values than control cells, but both 30 min and 3 h after plating, which means that it did not correspond to the transient effect of fast spreading but probably rather to the effect of the drugs on the measure.

      Graph showing FLIPPER lifetime before and after osmotic shock for HeLa cells plated on PLL- coated substrate. Left: control (N=3, n=119) and hyperosmotic shock (N=3, n=115); Right: control (N=3, n=101) and hypoosmotic shock (N=3, n=80). p-value are obtained by t-test.

      Graph showing FLIPPER lifetime for control just after the plating on PLL-coated glass (the same data for control shown at the previous graph), 30 min (control: N=3, n=88; Y-27+CK-666: N=3, n=130) and 3 h (control: N=3, n=78; Y-27+CK-666: N=3, n=142) after plating on fibronectin-coated glass. p-value are obtained by t-test.

      Because the cell to cell variability might mask the trend of single cell changes in lifetime during spreading, we also tried to follow the lifetime of individual cells every 5 min along the spreading. Most illuminated cells did not spread, while cells in non-illuminated fields of view spread well, suggesting that even with an image every 5 minutes and the lowest possible illumination, the imaging was too toxic to follow cell spreading in time. We could obtain measures for a few cells, which did not show any particular trend, but their spreading was not normal. So we cannot really conclude much from these experiments.

      Graph showing FLIPPER lifetime changes for 3 individual cells plated on fibronectin-coated glass (shown in blue, magenta and green) and average lifetime of cells from non-illuminated field (cyan, n=7)

      Our conclusions are the following:

      1) We are able to visualize some change in the lifetime of the probe for osmotic shock experiments, similar to the published results, but with a rather large cell to cell variability.

      2) The spreading experiments comparing 30 minutes and 3 hours, in control or drug treated cells did not reproduce the results we observed with tether pulling, with a global effect of the drugs on the measures at both 30 min and 3 hours.

      3) Following single cells in time led to too much toxicity and prevented normal spreading.

      We think that this technology, which is still in its early developments, especially in terms of the microscope setting that has to be used (and we do not have it in our Institute, so we had to go on a platform in another institute with limited time to experiment), cannot be implemented in the frame of the revision of this article to provide reliable results. We thus consider that these experiments are for further development of the work and are out of the scope of this study. It would be very interesting to study in details the comparison between the oldest and more established method of tether pulling and the novel method of the FLIPPER probe, during cell spreading and in other contexts. To our knowledge this has never been done so far, so it is not in the frame of this study that we can do it. It is not clear from the literature that the two methods would measure the same thing in all conditions even if they might match in some.

    1. Author Response

      Reviewer #2 (Public Review):

      In this manuscript, the authors performed single-cell RNA sequencing (scRNA-seq) analysis on bone marrow CD34+ cells from young and old healthy donors to understand the age-dependent cellular and molecular alterations during human hematopoiesis. Using a logistic regression classifier trained on young healthy donors, they identified cell-type composition changes in old donors, including an expansion of hematopoietic stem cells (HSCs) and a reduction of committed lymphoid and myeloid lineages. They also identified cell-type-specific molecular alterations between young and old donors and age-associated changes in differentiation trajectories and gene regulatory networks (GRNs). Furthermore, by comparing the single-cell atlas of normal hematopoiesis with that of myelodysplastic syndrome (MDS), they characterized cellular and molecular perturbations affecting normal hematopoiesis in MDS.

      The present manuscript provides a valuable single-cell transcriptomic resource to understand normal hematopoiesis in humans and the age-dependent cellular and molecular alterations. However, their main claims are not well supported by the data presented. All results were based on computational predictions, not experimentally validated.

      Major points:

      1) The authors constructed a regularized logistic regression trained on young donors with manually annotated cell types and predicted cell type labels of cells from old and MDS samples. As the manual annotation of cell types was implicitly assumed as ground truth in this manuscript, I'm wondering whether the predicted cell types in old and MDS samples are consistent with the manual annotation. They should apply the same strategy used in young samples for manual annotation to old and MDS samples, and evaluate how accurate their classifier is.

      We performed manual annotation for each MDS sample independently, and for the 3 healthy elderly donors integrated dataset. To do so, we performed unsupervised clustering with Seurat and annotated the clusters using the same set of canonical marker genes that we used for the young data. We then analyzed the correspondences between the annotated clusters and the predictions by GLMnet. Results are shown on Figure 1a. We observe that the biggest disagreements between methods occur between adjacent identities, such as HSC and LMPP, GMP and GMP with more prominent granulocytes profile, or MEP, early and late erythroid. When we explore these disagreements along the erythroid branch, we see that they particularly occur close to the border between subpopulations (Figure 1b). This is consistent with the continuous nature of the differentiation and the difficulty to establish boundaries between cell compartments. However, we observe that miss-labeling between different hematopoietic lineages is rare.

      In addition, unsupervised clustering was not always able to directly separate the data in the expected subpopulations. We can see different clusters containing the same cell types (e.g. LMPP1, LMPP2), as well as individual clusters containing cells with different identities (e.g. pDC and monocyte progenitors). This is usually due to sources of variability different to cell identity present in the data Additional, supervised finetuning by local sub clustering and merging would be needed to correct for this. On the contrary, we believe that our GLMnet-based method focusses on gene expression related to identity, resulting in a classification that is better suited for our purpose.

      Figure 1 Comparison between GLMnet predictions and manually annotated clusters A) Heatmaps showing percentages of cells in manually annotated clusters (columns) that have been assigned to each of the cell identities predicted by our GLMnet classification method (rows). The analysis was performed independently for the elderly integrated dataset and for every MDS sample. B) UMAP plots showing disagreements in classification between adjacent cell compartments in the erythroid branch. Cells from one erythroid cluster per patient are colored by the identity assigned by the GLMnet classifier. Cells in gray are not in the highlighted cluster, nor labeled as MEP, erythroid early or erythroid late by our classifier.

      2) The cell-type composition changes in Figures 1 and 4 were descriptively presented without providing the statistical significance of the changes. In addition, the age-dependent cell-type composition changes should be validated by flow cytometry.

      We thank the reviewer for the comment. Significance of the changes is included in Supplementary File 3. In addition, we included the percentage of several cell types we validated by flow cytometry, namely HSCs, GMPs and MEPs, in young and elderly healthy individuals in the manuscript, as Figure 1-figure supplement 3. Similarly to what we detected in our bioinformatic analyses, flow cytometry data demonstrated a significant increase in the percentage of HSCs, as well as an increasing trend in MEPs and a slight decrease in the percentage of GMPs in elderly individuals, corroborating our previous results.

      3) In Figure 2, the authors used two different pseudo-time inference methods, STREAM, and Palantir. It is not clear why they used two different methods for trajectory inference. Do they provide the same differentiation trajectories? How robust are the results of trajectory inference algorithms? It seems to be inconsistent that the pseudotime inferred by STREAM was not used for downstream analysis and the new pseudotime was recalculated by using Palantir.

      We thank the reviewer for the comment. The reason behind using two different methods to perform similar analyses, is that each of them provides specific outputs that can be used to perform a more robust and comprehensive analysis. STREAM allows to unravel the differentiation trajectories in a single cell dataset with an unsupervised approach. Also the visualization provided by STREAM (Figure 2C and 2D) allows for a simple interpretation of the results to the reader. On the other hand, Palantir provides a more robust analysis to dissect how gene expression dynamics interact and change with differentiation trajectories. For this reason, we decided to use this second method to investigate how specific genes were altered in the monocytic compartment.

      As a resource article, the showcase of different methods can be valuable as it provides examples on how each tool can be used to obtain specific results, which can help any reader to decide which might be the best tool for their specific case.

      Just to confirm that pseudotime results are similar, we perform a correlation analysis with the pseudotime values obtained from each method. We observed a correlation coefficient of 0.78 (p.val < 2.2e-16) confirming the similarity among both tools.

      Figure 2. Correlation analysis of pseudotime values obtained with STREAM and PALANTIR.

      4) In Figure 2D, some HSCs seem to be committed to the erythroid lineage. The authors should carefully examine whether these HSCs are genuinely HSCS, not early erythroid progenitors.

      We thank the reviewer for the comment. We have performed a deep analysis regarding the classification of HSCs (See Figure 3). Our analyses reveal that none of the cells classified as HSCs express early erythroid progenitor markers. We have also used STREAM to show the expression of these markers along the obtained trajectory and observed that erythroid markers show expression in the erythroid trajectory but not in the HSC compartment (Figure 4).

      Figure 3 Expression of marker genes in the HSC compartment. Dot plot depicting the normalized scaled expression of canonical marker genes by HSC of the 5 young and 3 elderly healthy donors. Marker genes are colored by the cell population they characterize. Dot color represents expression levels, and dot size represents the percentage of cells that express a gene.

      Figure 4. Expression of erythroid markers in STREAM trajectories. Expression of GATA1 and HBB (erythroid markers) in the predicted differentiation trajectories.

      5) It is not clear how the authors draw a conclusion from Figure 3D that the number of common targets between transcription factors is reduced. Some quantifications should be provided.

      We thank the reviewer for the comment. We have updated the manuscript to better reflect our findings and emphasize that the predicted regulatory networks of HSCs in elderly donors is displayed as an independent network, compared to the young donors. (Page 6, line 36).

      “Overall, we observed that the predicted regulatory network of elderly HSCs (Figure 3d) appeared as an independent network compared to the young GRN. This finding could result in the loss of co-regulatory mechanisms in the elderly donors.”

      6) The constructed GRNs and related descriptions were based solely on the SCENIC analysis. By providing the results of an orthogonal prediction method for GRNs, the authors should evaluate how robust and consistent their predictions are.

      We thank the reviewer for the comment regarding the method to build gene regulatory networks. As a resource article, our manuscript describes a complete workflow to perform different aspects of single cell analyses. These steps go from automated classification, trajectory inference and GRN prediction. All the selected algorithms have already been benchmarked and compared against other tools that perform similar analysis. SCENIC has already been benchmarked against other algorithms (11) and by others (12).

      We do agree with the reviewer that these new predictions could provide strength to our findings, however we believe that these orthogonal predictions would better fit if our article was intended for the Research Article category instead of Tools and Resources.

      7) The observed age-dependent cellular and molecular alterations in human hematopoiesis are interesting, but I'm wondering whether the observed alterations are driven by inflammatory microenvironment or intrinsic properties of a subpopulation of HSCs affected by clonal hematopoiesis (CH). To address this, the authors can perform genotyping of transcriptomes (GoT) on old healthy donors with CH. By comparing the transcriptomes of cells with and without CH mutations, we can evaluate the effects of CH on age-associated molecular alterations.

      We thank the reviewer for the comment. Unfortunately, in order to perform GoT (genotyping of transcriptomes) on the healthy donors, requires modifying the standard 10x Genomics workflow to amplify the targeted locus and transcript of interest. This would require collecting new samples, optimizing the method and performing new analysis from scratch (from sequencing up to analysis). We believe this is not in the scope of the manuscript. On the other hand, we don’t have enough material to create new single cell libraries, this fact would require the addition of new donors and as a result, a complete new analysis to perform the integration.

      Reviewer #3 (Public Review):

      The authors have performed a transcriptional analysis of young/aged hematopoietic stem/progenitor cells which were obtained from normal individuals and those with MDS.

      The authors generated an important and valuable dataset that will be of considerable benefit to the field. However, the data appear to be over-interpreted at times (for example, GSEA analysis does not have "functionality", as the authors claim). On the other hand, a comparison between normal-aged HSC and HSC from MDS patients appears to be under-explored in trying to understand how this disease (which is more common in the elderly) disrupts HSC function.

      A more extensive cross-referencing of other normal HSPC/MDS HSCP datasets from aged humans would have been helpful to highlight the usefulness of the analytical tools that the authors have generated.

      Major points

      1) The authors detail methodology for identification of cell types from single-cell data - GLMnet. This portion of the text needs to be clarified as it is not immediately clear what it is or how it's being used. It also needs to be explained by what metric the classifier "performed better among progenitor cell types" and why this apparent advantage was sufficient to use it for the subsequent analysis. This is critical since interpretation of the data that follows depends on the validation of GLMnet as a reliable tool.

      We thank the review for the comment. We have updated the corresponding section to better describe how GLMnet is used and that the reasoning on why we decided to use GLMnet as our cell type annotation method instead of other available tools such as Seurat, is based on the results of the benchmark described in Figure 1-figure supplement 1. We also described the main differences between our method and Seurat (See Answer to Review 1, Question # 4).

      2) The finding of an increased number of erythroid progenitors and decreased number of myeloid cells in aged HPSC is surprising since aging is known to be associated with anemia and myeloid bias. Given that the initial validation of GLMnet is insufficiently described, this result raises concerns about the method. Along the same lines, the authors report that their tool detects a reduced frequency of monocyte progenitors. How does this finding correlate with the published data on aging humans? Is monocytopenia a feature of normal aging?

      We thank the reviewer for this comment, as changes in the output of HSCs as a consequence of aging are of high interest. According to the literature, there is clear evidence of the loss of lymphoid progeny with age (13,14), which goes in agreement with our results. However, in the case of the myeloid compartment, the effects of aging are not as clear. Studies in mice have indeed observed that the loss of lymphoid cells is accompanied by increased myeloid output, starting at the level of GMPs (Rossi et al. 2005; Florian et al. 2012; Min et al. 2006). But studies on human individuals have not found changes in numbers of these myeloid progenitors (Kuranda et al. 2011; Pang et al. 2011). In addition, in the mentioned studies, myeloid production was measured exclusively by its white blood cells fraction. More recent studies have focused on the other myeloid compartments: megakaryocyte and erythroid cells. Results point towards the increase of platelet-biased HSC with age (Sanjuan-Pla et al. 2013; Grover et al. 2016) and a possible expansion of megakaryocytic and erythroid progenitor populations (Yamamoto et al. 2018; Poscablo et al. 2021; Rundberg Nilsson et al. 2016), which may represent a compensatory mechanism for the ineffective differentiation towards this lineage in elderly individuals. This goes in line with the accumulation of MEPs we see in our data. Finally, and in accordance with the reduced frequency of monocyte progenitors observed, it has been shown that with increasing age, there is a gradual decline in the monocyte count (15).

      Regarding the concerns about our classification method raised by the reviewer, we have performed additional validations that we describe in answers to reviewer 1 comment #4 and reviewer 2 comment #1. To further confirm that the changes in cellular proportions we found are real, we applied two additional classification methods: Seurat transfer and Celltypist (16) to the elderly donors dataset. We obtained a similar expansion in MEPs, together with reduction of monocytic progenitors with the three methods (Figure 5).

      Figure 5 Classification of HSPCs from elderly donors. Barplot showing proportions of every cell subpopulation per elderly donor, resulting from three classification methods: GLMnet-based classifier, Seurat transfer and Celltypist. For the three methods, cells with prediction scores < 0,5 were labeled as “not assigned”.

      3) The use of terminology requires more clarity in order to better understand what kind of comparison has been performed, i.e. whether global transcriptional profiles are being compared, or those of specific subset populations. Also, the young/aged comparisons are often unclear, i.e. it's not evident whether the authors are referring to genes upregulated in aged HSC and downregulated in young HSC or vice versa. A more consistent data description would make the paper much easier to read.

      We thank the reviewer for this comment. We have updated the manuscript to provide more clarity in the description of the different comparisons made in our analyses. Most changes are located in the Transcriptional profiling of human young and elderly hematopoietic progenitor systems sub-section within the Results.

      4) The link between aging and MDS is not explored but could be an informative use of the data that the authors have generated. For example, anemia is a feature of both aging and MDS whereas neutropenia and thrombocytopenia only occur in MDS. Are there any specific pathways governing myeloid/platelet development that are only affected in MDS?

      Thank you for raising this comment. We believe that discriminating events that take place during healthy aging from those associated to MDS will be helpful to understand this particular disease, as it is so closely related to age. This is why, when analyzing MDS, we have considered young and elderly donors as two separate sets of healthy controls, the eldery donors being the most suitable one for comparisons with MDS samples.

      With regards to the comment on myeloid and platelet development, the GSEA analysis gives potentially useful information. MYC targets and oxidative phosphorylation are significantly enriched in the MEP compartment from MDS patients when compared to elderly donors, indicating that these progenitors may recover a more active profile with the disease. Hypoxia related genes, on the other hand, are more active in HSCs and MEPs from healthy elderly donors than in MDS. Hypoxia is known to be implicated in megakaryocyte and erythroid differentiation (17)

      5) MDS is a very heterogeneous disorder and while the authors did specify that they were using samples from MDS with multilineage dysplasia, more clinical details (blood counts, cytogenetics, mutational status) are needed to be able to interpret the data.

      We thank the reviewer for the comment. All the clinical details for each MDS patient are included in Supplementary File 5.

    1. Author Response

      Reviewer #3 (Public Review):

      Dysbiosis has a substantial impact on host physiology. Using the nematode C. elegans and E.coli as a model of host-microbe interactions, Yang et al. defined a mechanism by which the host deals with gut dysbiosis to maintain fitness. They found that accumulation of E. coli in the intestine secreted indole, a tryptophan metabolite, and activated the transcription factor DAF-16. DAF-16 induced the expression of lys-7 and lys-8, which in turn limited E. coli proliferation in the gut of worms and maintained the longevity of worms. Finally, these authors demonstrated that indole-activated DAF-16 via TRPA-1 in neurons of worms.

      This study revealed a new mechanism of host-microbe interaction. The concept of their work is of broad interest and the results they present are convincing. However, there are some issues that need to be addressed to support the conclusions.

      Major issues

      1) The authors isolated the crude extract from a high-performance liquid chromatograph (HPLC). A candidate compound was detected by activity-guided isolation and further identified as indole with mass spectrometry and NMR data. The HPLC fractionations and activity-guided isolation experiments should be described in more detail with a schematic figure to reveal how these experiments were performed and how indole was identified. Showing a chemical characterization of indole in Figure 2A is not sufficient for the evaluation of the results. Rather, a figure comparing the fraction 26th with standard indole by MS and NMR is more appealing.

      We appreciate the concerns of the reviewer. Activity-guided isolation was performed as follows: The crude extract of E. coli supernatant metabolites was divided into 45 fractions according to polarity using Ultimate 3000 HPLC (Thermofisher, Waltham, MA) coupled with automated fraction collector. After freeze-drying each fraction, 1 mg of metabolites were dissolved in DMSO for DAF-16 nuclear localization assay in worms (Please see new Supplementary Table S2). The 26th fraction with DAF-16 nuclear translocation-inducing activity was then separated on silica gel column (200-300 mesh) with a continuous gradient of decreasing polarity (100%, 70%, 50%, 30%, petroleum ether/acetone) to yield four fractions (26a-d). Only the fraction of 26b could induce DAF-16 nuclear translocation. Then the fraction was further separated using a Sephadex LH-20 column to yield 32 fractions. The 26b-11th fraction with DAF-16 nuclear translocation-inducing activity contained a single compound identified by thin layer chromatography, mass spectrometry and nuclear magnetic resonance (NMR). The compound exhibited a quasimolecular ion peak at m/z 181.0782 [M+H]+ in the positive APCI-MS, and was assigned to a molecular formula of C8H7N. A comparison of these 1H NMR and 13C NMR spectra with the data reported in the literature revealed that the compound was indole (Yagudaev, 1986). The figure shows the comparison of the 26b-11 fraction with the standard indole by MS (Author response image 1).

      Author response image 1.

      High resolution mass spectrum of the candidate compound and indole.

      2) DAF-16::GFP was mainly located in the cytoplasm of the intestine in worms expressing daf-16p::daf-16::gfp fed live E. coli OP50 on Day 1 (Figure 1A and 1B). The nuclear translocation of DAF-16 in the intestine was increased in worms fed live E. coli OP50 on Days 4 and 7, but not in age-matched WT worms fed heat-killed (HK) E. coli OP50 (Figure 1A and 1B). Since DAF-16 functions downstream of DAF-2, have the levels of DAF-2 been tested during aging on OP50 and (HK) OP50, or with and without indole supplementation?

      In response to the reviewer’s suggestion, we carried out the RT-PCR experiment in 4-day-old and 7-day-old worms. It has been shown that DAF-2 initiates a kinase cascade that leads to the phosphorylation and cytoplasmic retention of DAF-16. By contrast, a reduction in the DAF-2 signaling leads to the dephosphorylation of DAF-16, allowing its nuclear translocation. In response to the reviewer’s suggestion, we tested the expression of daf-2 in 4-day-old and 7-day-old worms fed with OP50 and (HK) OP50. We found that the mRNA levels of daf-2 were significantly increased in worms on days 4 and 7 in the presence of either live or dead E. coli OP50, compared with those in worms on day 1 (Author response image 2A). In addition, supplementation with indole did not alter the mRNA levels of daf-2 in young adult worms (Author response image 2B). To conclude, the activation of DAF-16 is independent of DAF-2.

      Author response image 2.

      DAF-16 nuclear translocationisindependent of DAF-2.(A) The mRNA levelsof daf-2weregradually increasedin worms with age.P< 0.01;*P< 0.001; ns, not significant. (B)The mRNA levelsof daf-2were not alteredaftertreatment withindole for 24 hours.ns, not significant.

      3) In lines 155-157, the author argued that the increase in the levels of indole in worms results from the intestinal accumulation of live E. coli OP50, rather than exogenous indole produced by E. coli OP50 on the NGM plates. However, the work also showed that supplementation with indole (50-200 μM) could significantly increase the indole levels in young adult worms on Day 1 (Figure 2-figure supplement 3B), which could induce nuclear translocation of DAF-16 in worms (Figure 2B). This result suggested that worms could take in indole from outside culturing environment. The concentration of indole in OP50 and (HK) OP50 could be measured.

      We appreciate the concerns of the reviewer. Reviewer #2 also pointed out this problem. In this study, our data showed that the levels of indole were 30.9, 71.9, and 105.9 nmol/g dry weight in worms fed live E. coli OP50 on days 1, 4, and 7, respectively (Figure 2C). This increase in the levels of indole in worms was accompanied by an increase in CFU of live E. coli OP50 in the intestine of worms with age (Figure 2C). In addition, we determined the levels of indole in worms fed HK E. coli OP50, and found that the levels of indole were 28.2, 31.6, and 36.1 nmol/g dry weight in worms fed HK E. coli OP50 on days 1, 4, and 7, respectively (Figure 2-figure supplement 3A). It should be noted that the levels of indole in worms fed dead E. coli OP50 on day 1 were comparable of those in worms fed live E. coli OP50 on day 1 (30.9 vs 28.2 nmol/g dry weight). However, the levels of indole were not increased in worms fed HK E. coli OP50 on days 4 and 7. Furthermore, the observation that DAF-16 was retained in the cytoplasm of the intestine in worms fed live E. coli OP50 on day 1 (Figure 1A and 1B) also indicated that indole produced by E. coli OP50 on the NGM plates is not enough to induce DAF-16 nuclear translocation. By contrast, supplementation with indole (50-200 μM) significantly increased the indole levels in worms on day 1 (Figure 2-figure supplement 3B), which could induce nuclear translocation of DAF-16 in worms (Figure 2B). Thus, the increase in the levels of indole in worms with age results from intestinal accumulation of live E. coli OP50, rather than indole produced by E. coli OP50 on the NGM plates.

      4) Recent work showed that the multicopy DAF-16 transgene acts differently from the single copy GFP knock in DAF-16 transgene. Which DAF-16 transgene was used in this work?

      The strain we used is TJ356. Its genotype has been described as zIs356 [daf-16p::daf-16a/b::GFP+rol-6(su1006)] (Lee, Hench, & Ruvkun, 2001; Lin, Hsin, Libina, & Kenyon, 2001), from the Caenorhabditis Genetics Center (CGC).

      5) In lines 190-193, the author argued that the supplementation with indole (100 M) inhibited the CFU of E. coli K-12 in WT worms, but not daf-16(mu86) mutants, on Days 4 and 7 (Figure 3H and 3I). These results suggest that endogenous indole is involved in maintaining a normal lifespan in worms. This is overstating. The data here more likely suggest that indole could inhibit the proliferation of E. coli through DAF-16.

      We really appreciate this reviewer’s preciseness. In response to the reviewer’s suggestion, we had changed "...indole is involved in maintaining a normal lifespan in worms" to "...indole produced by bacteria in the gut could inhibit the proliferation of E. coli via DAF-16 in worms".

      6) Sonowal (2017) reported that AHR mediates indole-promoted lifespan extension at 16 C. Yet this work argued that RNAi knockdown of ahr-1 did not affect the nuclear translocation of DAF-16 in worms fed E. coli K12 strain on Day 7 (Figure 4-figure supplement 1A) or young adult worms treated with indole (100 M) for 24 h. The difference between these two works should be discussed.

      We really appreciate this reviewer’s preciseness. It has been shown that AHR-1 mediates indole-promoted lifespan extension in worms at 16 C (Sonowal et al., 2017). However, our data show that AHR-1 is not involved in activation of DAF-16 by indole-induced nuclear translocation of DAF-16 at 20 C. This means that AHR-1 and TRPA-1-lifespan extension by indole are essentially different. In our study, indole is added to NGM plates when worms reached the young adult stage. In the study by Sonowal et al., indole is supplemented at the stage of L1 larva. In addition, lifespan of C. elegans varies at different temperatures (Xiao et al., 2013). Thus, indole may promote lifespan extension via different mechanisms, which is dependent on exposure time and temperature.

      7) Sonowal (2017) conducted mRNA profiling for worms growing on K12 and K12△tnaA. Is TRPA1 in their de-regulated gene list? Have other de-regulated genes been tested in this work?

      We appreciate the concerns of the reviewer. We found that TRPA-1 is not included in the de-regulated gene list. Sonowal et al. focus on the gene expression profiles in worms from L1 larvae to young adults, whereas we pay attention to gene expression profiles in worms from young adults to aged worms. Thus, we did not test the de-regulated genes in their work.

      8) How does indole activate TRPA1? In the absence of trpa1, what is the concentration of indole in worms? Since TRPA1 is a channel, is there any possibility that TRPA1 is involved in the transport of indole? It is really interesting and surprising that neuronal TRPA-1, but not intestinal TRPA-1, mediates the beneficial effect of indole. How does indole specifically activate TRPA-1 in neurons to preserve the longevity of worms?

      We appreciate the concerns of the reviewer. TRPA1 is a nonselective cation channel permeable to Ca2+, Na+, and K+ (Zygmunt & Hogestatt, 2014). It is unlikely that TRPA1 is capable of transporting heterocyclic organic compounds, such as indole.

      In response to the reviewer’s suggestion, we detected the content of indole in trpa-1(ok999) worms. We found that the levels of indole in trpa-1(ok999) worms were slightly increased in worms on days 4 and 7, compared to those in WT worms on days 4 and 7 (Author response image 3).

      Recently, Ye et al. have demonstrated that indole and indole-3-carboxaldehyde (IAld) are agonists of TRPA1, which is conserved in vertebrates (Ye et al., 2021). Thus, it is mostly likely that indole acts as an agonist of TRPA-1 in C. elegans by directly binding to TRPA-1. One possibility is that activation of TRPA-1 in neurons by indole could induce a pathway that release a neurotransmitter, which in turn triggers a signaling pathway to extend lifespan of worms via activating DAF-16 in a non-cell autonomous manner. In contrast, the activation of TRPA-1 in the intestine by indole is unable to release such a neurotransmitter. Indeed, TRPA1 induces the releasing of calcitonin gene-related peptide in perivascular sensory nerves, leading to membrane hyperpolarization and arterial dilation on smooth muscle cells (Talavera et al., 2020). Moreover, the activation of TRPA1 by indole and IAld induces the secretion of the neurotransmitter serotonin in zebrafish (Ye et al., 2021).

      Author response image 3.

      The indole levels in trpa-1 mutants are increased on days 4 and 7, compared with those in WT worms. *P < 0.05.

      9) How neuronal- and intestinal-specific knockdown of trpa-1 by RNAi was conducted? And what is the tissue-specific expression pattern of trap-1? Speculating how indole was transported to neuron cells is pretty appealing.

      We appreciate the concerns of the reviewer. SID-1 is required cell-autonomously for systemic RNAi (Winston, Molodowitch, & Hunter, 2002). Thus, the sid-1 mutants are resistant to RNAi in the neuronal- and intestinal-specific RNAi strains, sid-1 was expressed under control of the neuronal-specific unc-119 and the intestinal-specific vha-6 promoters, respectively. Although it has been reported that TRPA-1 is expressed in neurons, muscles, hypodermal cells, and the intestine, Xiao et al. proved that only TRPA-1 expressed in the intestine and neurons contributes to life extension at low temperature (Xiao et al., 2013). The transporter of indole has not been identified. In Arabidopsis, ATP-binding cassette (ABC) transporter G family 37(ABCG37) has been reported to transport a range of indole derivatives (Ruzicka et al., 2010). However, all fifteen C. elegans ABC transporters share less than 30% sequence identity with ABCG37. Thus, it is impossible to determine which one is the transport channel for indole and indole derivatives in C. elegans.

      10) Supplementation with indole only up-regulated the expression of lys-7 and lys-8 in worms subjected to intestinal-specific (Figure 7-figure supplement 2C), but not neuronal-specific, RNAi of trpa-1 (Figure 7-figure supplement 2D). If this is the case, should the addition of indole specifically induce the expression of lys-7p::gfp or lys-8p::gfp in neurons?

      We really appreciate this reviewer’s preciseness. Indeed, lys-7 and lys-8 are expressed in both neurons and the intestine (Author response image 4A and 7B). However, the expression of lys-8p::gfp and lys-7p::gfp in neurons was not altered in worms after treatment with indole or knockdown of trpa-1 by RNAi (Author response image 4C and 4D).

      Author response image 4.

      The expression of LYS-7 and LYS-8 in neurons is not altered after treatment with indole or knockdown of trpa-1 by RNAi. (A and C) Representative images of lys-7p::gfp (A) and lys-8p::gfp (C). Both lys-7 and lys-8 could be expressed in neurons and the intestine. (B and D) Quantification of fluorescent intensity of lys-7p::gfp (B) and lys-8p::gfp (D) in neurons. These results are means ± SD of three independent experiments. ns, not significant.

      11) The authors demonstrated that K-12△tnaA strain had undetectable tnaA mRNA or indole levels. Furthermore, the deletion of tnaA significantly inhibited the nuclear translocation of DAF-16 in worms. However, mutations in E. coli still have non-specific effects as there are several transposon insertions or polar mutations influencing downstream genes. The authors should demonstrate that only disruption of TnaA causes the failure of nuclear translocation of DAF-16.

      In response to the reviewer’s suggestion, we rescued the expression of tnaA in the K-12 △tnaA strain. As expected, the indole level of from the supernatant in the K12 △tnaA::tnaA strain cultures was 34.1 μmol/L, which was comparable of that in the K12 strain cultures (42.5 μmol/L)(new Figure 2-figure supplement 4D). In addition, DAF-16 nuclear accumulation was increased in worms grown in the K12 △tnaA::tnaA strain on days 4 and 7 (new Figure 2-figure supplement 4E).

    1. Author Response:

      Reviewer #1 (Public Review):

      In their manuscript entitled "PBN-PVT projection modulates negative emotions in mice", Zhu et al. combine circuit mapping techniques with behavioral manipulations to interrogate the function of anatomical projections from the parabrachial nucleus (PBN) to the paraventricular nucleus of the thalamus (PVT). The study addresses an important scientific question, since the PVT and particularly the posterior PVT is known to be mostly sensitive to aversive signals, but the neural circuit mechanisms underlying this process remain unknown. Here the authors contribute important evidence that PBN inputs to the PVT may be critical for this process. Specifically, the authors identify that the PVT receives glutamatergic projections from the PBN that promote aversive behavioral responses but do not modulate nociception. The latter finding is intriguing considering that the PBN is an important node in pain processing and that the PVT has recently emerged as a modulator of pain. Overall, the study includes an impressive array of techniques and manipulations and offers insight to an important scientific question. The authors' conclusions will be significantly strengthened by the inclusion of some additional experiments and controls.

      It is in my view problematic that the authors used different genetic strategies to target the PBN-PVT pathway. For example, in Figure 1 the authors used Vglut2-cre mice for the anterograde tracings but later on in the same figure used constitutively expressed ChR2 in the PBN to assess functional connectivity with the PVT using ex-vivo patch-clamp electrophysiology. In Figure 2 the authors once again employed Vglut2-Cre mice to target PBN projections to the PVT and manipulate these projections optogenetically during behavioral tests. However, in the following figure (Fig. 3) the authors then use a retro-Cre approach and chemogenetics. The interchangeable use of these different manipulations is not warranted by data presented by the authors. For example it is unclear whether all PBN neurons projecting to the PVT are glutamatergic and express VGLUT2. When using the constitutively expensed ChR2 in the PBN to demonstrate glutamatergic projections to the PVT, the authors may be faced by potential contamination from adjacent brain stem structures like the LC and DRN, which project to the PVT and are known to contain glutamatergic neurons (vglut1 and vglut3, respectively). Another example, for figure 4 why did the authors not use Vglut2-cre mice and inhibited PBN terminals in the PVT as in Figure 2?

      We agree with the reviewer. Now we have reframed this manuscript. We first presented the slice recording results from wild-type mice (Figure 1). We recorded both the EPSCs and IPSCs. We found that light-induced EPSCs in 34 of 52 neurons and light-induced IPSCs in 4 of 52 neurons. Please see Page 5 Line 119 to Line 121. We carefully examined the ChR2 virus infection area. Please see the following Fig R1 showcase. We found that there were dense ChR2-mCherry+ neurons in the PBN. We also observed ChR2-mCherry+ neurons in the nearby ventrolateral periaqueductal gray (VLPAG), locus coeruleus (LC), cuneiform nucleus (CnF), and laterodorsal tegmental nucleus (LDTg). And the dorsal raphe nucleus (DR) was not infected. We agreed with the reviewer that there could be potential contamination from the LC, which releases dopamine and norepinephrine to the PVT by LC-PVT projection. We have discussed this on Page 13 Line 375 to Line 380.

      Figure R1. AAV-hSyn-ChR2-mCherry virus infection showcase. LPBN, lateral parabrachial nucleus. MPBN, medial parabrachial nucleus; VLPAG: ventrolateral periaqueductal gray; LC, locus coeruleus; CnF, cuneiform nucleus; LDTg, laterodorsal tegmental nucleus; DR, dorsal raphe nucleus; scp, superior cerebellar peduncle, scale bar: 200 μm.

      We performed tdTomato staining with VgluT2 mRNA in situ hybridization and found that about 94.4% of tdTomato+ neurons express VgluT2 mRNA. These results indicate that the majority of PVT-projecting PBN neurons are glutamatergic. These new results have been included in Figure 1R−U.

      Then we used VgluT2-ires-Cre mice to perform tracing (Figure1−figure supplement 2) and behavioral tests (optogenetic activation in Figure 2, optogenetic inhibition in Figure 4). We also performed the pharmacogenetic activation of PVT-projecting PBN neurons on wild-type mice (Figure 3). We observed that pharmacogenetic activation of the PVT-projecting PBN neurons reduced the center duration in the OFT, similar to the optogenetic activation OFT result. We also observed that pharmacogenetic activation of the PVT-projecting PBN neurons induced freezing behaviors. Our pharmacogenetic activation experiment supported the hypothesis that PBN-PVT projections modulate negative affective states.

      Now we have now performed the optogenetic inhibition of the PBN-PVT projections using VgluT2-ires-Cre mice. We found that inhibition of PBN-PVT projections reduces 2-MT-induced aversion-like behaviors and footshock-induced freezing behaviors. These new results have been included in Figure 4, Figure 4−figure supplement 1 and 2, and were described in the text. Please see the text Page 9 Line 254 to Page 10 Line 274.

      Related to the previous point, in the retrograde labeling experiment (Fig. 1) it would be useful if the authors determined what fraction of retrogradely label cells are indeed VGLUT2+. For behavioral experiments employing the retro-Cre approach the authors may be manipulating a heterogenous population of PBN neurons which could be influencing their behavioral observations. In general, the authors should ensure that a similar population of PBN-PVT neurons is been assessed throughout the study.

      We have now performed tdTomato staining with VgluT2 mRNA in situ hybridization and found that approximately 94.4% of tdTomato+ neurons expressed VgluT2 mRNA. These results indicated that the majority of PVT-projecting PBN neurons are glutamatergic. These new results have been included in Figure 1R−U and were described in the text. Please see Page 5 Line 129 to Line 132.

      The authors' grouping of the behavioral data into the first vs the last four minutes of light stimulation in the OF does not seem to be properly justified an appears rather arbitrary. Also related to data analysis, the unpaired t-test analysis in the fear conditioning experiment in Figure 4J seems inappropriate. ANOVA with group comparisons is more appropriate here.

      To provide a more detailed profile of the behaviors in the OFT, we further divided the laser ON period (5−10 minutes) into five one-minute periods and analyzed the velocity, non-moving time, travel distance, center time, and jumping. We found that the velocity and non-moving time were increased, and the center time was decreased in the ChR2 mice during most periods. Furthermore, we observed that the travel distance and jumping behaviors were increased only in the first one-minute period in ChR2 mice. These new results have been included in Figure 2−figure supplement 2 and were described in the text. Please see Page 7 Line 179 to Line 189. We also discussed this on Page 14 Line 396 to Line 403.

      We now performed the optogenetic inhibition of PBN-PVT projections in footshock-induced freezing behavior on Vglut2-ires-Cre mice (Figure 4J−K). And we revised the statistics (Unpaired student's t-test) and calculated the percentage of freezing behaviors in 10 minutes, which matched the constant optogenetic inhibition. Similar changes have been made in the Figure 4−figure supplement 3K.

      Considering the persistency of the effect in the OF following optogenetic stimulation of PBN-PVT afferents, the lack of such persistent effect in the RTPA is hard to reconcile. By performing additional experiments the authors attempt to settle this discrepancy by proposing that the PBN-PVT pathway promotes aversion but does not facilitate negative associations. I find this conclusion to be problematic. If the pathway is critical for conveying aversive signals to the PVT, one expects that at the very least it would be require for the formation of associate memories involving aversive stimuli. However, the authors do not show data to this effect. Instead they show that animals decrease their acute defensive reactions to aversive stimuli (2-MT and fear conditioning), but do not show whether associative memory related to this experience (e.g. fear memory retrieval) is impacted by manipulations of the PBN-PVT pathway.

      We have now performed several experiments to examine the effects of the PBN-PVT projections on aversion formation and memory retrieval.

      We first performed a prolonged conditioned place aversion that mimics drug-induced place aversion. And we found that optogenetic activation of PBN-PVT projections did not induce aversion in the postconditioning test on Day 4. These new results have been included in Figure 2−figure supplement 2H−I and described in the text. Please see Page 7 Line 196 to Line 199.

      Then, we performed the classical auditory fear conditioning test and found that optogenetic inhibition of PBN-PVT projections during footshock in the conditioning period did not affect freezing levels in contextual test or cue test (Laser OFF trials). And inhibition of PBN-PVT projections during contextual test or cue test (Laser On trials) did not affect freezing levels either. These data suggest that PBN-PVT projections are not crucial for associative fear memory formation or retrieval. These new results have been included in Figure 4−figure supplement 2 and described in the text. Please see Page 10 Line 268 to Page Line 274. We also discussed this on Page 15 Line 430 to Page 16 Line 473.

      A similar lack of connection between aversive signals within the PVT and the PBN pathway is found in the photometry data presented in Figure 5. While importantly the authors' observation of aversive modulation of the pPVT reproduces data from other recent studies, the question here is whether the increased activity of PVT neurons is mediated by input from the PBN. The cFos experiment included in this figure attempts to draw this connection, but empirical evidence is required.

      We have now performed the dual Fos staining experiment and the optoeletrode experiment.

      In the dual Fos staining experiment, we found that there was a broad overlap between optogenetic stimulation-activated neurons (expressing the Fos protein) and footshock-activated neurons (expressing the fos mRNA) (Figure 6−figure supplement 1B−E).

      In optoelectrode experiment, there was also a broad overlap between laser-activated and footshock-activated neurons. This result was consistent with the dual Fos staining result, suggesting that PVTPBN neurons were activated by aversive stimulation. Next, we analyzed the firing rates of PVT neurons during footshock with laser sweeps and footshock without laser sweeps. We found that the footshock stimulus with laser activated 30 of 40 neurons and increased the overall firing rates of 40 neurons compared with the footshock without laser result (Figure 6I). These results indicated that activation of PBN-PVT projections could enhance PVT neuronal responses to aversive stimulation.

      These new results have been included in Figure 6, Figure 6−figure supplement 1, and described in the text. Please see Page 10 Line 295 to Page 11 Line 317. We also discussed these results on Page 15 Line 422 to Line 429.

      Reviewer #2 (Public Review):

      Zhu et al. investigated the connectivity and functional role of the projections from the parabrachial nucleus (PBN) to the paraventricular nucleus of the thalamus (PVT). Using neural tracers and in vitro electrophysiological recordings, the authors showed the existence of monosynaptic glutamatergic connections between the PBN and PVT. Further behavioral tests using optogenetic and chemogenetic approaches demonstrated that activation of the PVT-PBN circuit induces aversive and anxiety-like behaviors, whereas optogenetic inhibition of PVT-projecting PBN neurons reduces fear and aversive responses elicited by footshock or the synthetic predator odor 2MT. Next, they characterized the anatomical targets of PVT neurons that receive direct innervation from the PBN (PVTPBN). The authors also showed that PVTPBN neurons are activated by aversive stimuli and chemogenetically exciting these cells is sufficient to induce anxiety-like behaviors. While the data mostly support their conclusions, alternative interpretations and potential caveats should be addressed in the discussion.

      Strength:

      The authors used different behavioral tests that collectively support a role for PBN-PVT projections in promoting fear- and anxiety-like behaviors, but not nociceptive or depressive-like responses. They also provided insights into the temporal participation of the PBN-PVT circuit by showing that this pathway regulates the expression of affective states without contributing for the formation of fear-associated memories. Because previous studies have shown that activation of projection-defined PVT neurons is sufficient to induce the formation of aversive memories, the differences between the present study and previous findings reinforce the idea of functional heterogeneity within the PVT. The authors further explored this functional heterogeneity in PVT by using an anterograde viral construct to selectively label PVT neurons that are targeted by PBN inputs. Together, these results connect two important brain regions (i.e., PBN and PVT) that were known to be involved in fear and aversive responses, and provide new information to help the field to elucidate the complex networks that control emotional behaviors.

      Weakness:

      The authors should avoid anthropomorphizing the behavioral interpretation of the findings and generalizing their conclusions. In addition, there is a series of potential caveats that could interfere with the interpretation of the results, all of which must be discussed in the article. For example, the long protocol duration of laser stimulation, the possibility of antidromic effects following photoactivation of PBN terminals in PVT, and the existence of collateral PBN projections that could also be contributing for the observed behavioral changes. Additional clarification about the exclusive glutamatergic nature of the PBN-PVT projection should be provided and the present findings should be reconciled with prior studies showing the existence of GABAergic PBN-PVT projections.

      We agree with the reviewer. Now we have revised the text carefully to avoid using subjective terms. We showed the light-induced EPSCs and IPSCs results in Figure 1, and we performed RNAscope experiments to clarify the glutamatergic nature of the PVT-projecting PBN neurons (Figure 1 and Figure1−figure supplement 1). We also added discussion about the laser stimulation protocol, the potential possibility of antidromic effects, and collateral projections. Please see Page 14 Line 413 to Page 15 Line 418, and Page 16 Line 449 to Line 457.

      We also added several experiments to dissect the effect of manipulation of the PBN-PVT projection in fear memory acquisition and retrieval. These new results have been included in Figure 4−figure supplement 2 and described in the text. Please see Page 10 Line 268 to Line 274. We also discussed this on Page 15 Line 430 to Page 16 Line 473.

      Reviewer #3 (Public Review):

      Zhu YB et al investigated the functional role of the parabrachial nucleus (PBN) to the thalamic paraventricular nucleus (PVT) in processing negative emotions. They found that PBN send excitatory projection to PVT. The activation of PBN-PVT projection induces anxiety-like and fear-like behaviors, while inhibition of this projection relieves fear and aversion.

      Strengths:

      The authors dissected anatomic and functional connection between the PBN and the PVT by using comprehensive modern neuroscience techniques including viral tracing, electrophysiology, optogenetics and pharmacogenetics. They clearly demonstrated the significant role of PBN-PVT projection in modulating negative emotions.

      Weaknesses:

      The PBN contains a variety of neuronal subtypes that expressed distinct molecular marker such as CGRP, Tac1, Pdyn, Nts et al. The PBN also send projections to multiple targets, including VMH, PAG, BNST, CEA and ILN that could mediate distinct function. What's the neuronal identity of PVT-projecting PBN neurons, how is the PVT projection and other projections organized, are they overlapping or relative independent pathway? Those important questions were not examined in this study, which make it hard to relate this finding to other existing literature.

      We have now performed the RNAscope experiments detecting VgluT2, Tac1, Tacr1, Pdyn mRNA, and fluorescent immunostaining detecting CGRP protein in the PBN. We found that about 94.4% of tdTomato+ neurons express VgluT2 mRNA. We also found that tdTomato+ neurons were only partially co-labeled with Tacr1, Tac1, or Pdyn mRNA, but not with CGRP. These results indicate that the majority of PVT-projecting PBN neurons are glutamatergic. These new results have been included in Figure 1, Figure 1−figure supplement 1, and were described in the text. Please see Page 5 Line 129 to Line 140.

      We also provided the collateral projections from PVT-projecting neurons in Figure 1−figure supplement 3, Page 6 Line 148 to Line 151, and discussed on Page 16 Line 449 to Line 457.

    1. Author Response

      Reviewer #1 (Public Review):

      “A sample size of 3 idiopathic seems underpowered relative to the many types of genetic changes that can occur in ASD. Since the authors carried out WGS, it would be useful to know what potential causative variants were found in these 3 individuals and even if not overlapping if they might expect to be in a similar biological pathway.

      If the authors randomly selected 3 more idiopathic cell lines from individuals with autism, would these cell lines also have altered mTOR signaling? And could a line have the same cell biology defects without a change in mTOR signaling? The authors argue that the sample size could be the reason for lack of overlap of the proteomic changes (unlike the phosphor-proteomic overlaps), which makes the overlapping cell biology findings even more remarkable. Or is the phenotyping simply too crude to know if the phenotypes truly are the same?”

      We appreciate these thoughtful comments and also agree that of several models, our studies indicate the possibility of mTOR alteration in multiple forms of ASD. As above, we are currently pursuing this hypothesis with newly acquired DOD support. With regard to the I-ASD population, we agree that there are a large variety of genetic changes that can occur in genetically undefined ASDs. Indeed, this is precisely why we expected to see “personalized” phenotypes in each I-ASD individual when we embarked on this study. At that time, several years ago, we had planned to expand the analyses to more I-ASD individuals to assess for additional personalized phenotypes. However, as our studies progressed, we were surprised to find convergence in our I-ASD population in terms of neurite outgrowth and migration and later proteomic results showing convergence in mTOR. We found it particularly remarkable that despite a sample size of 3 that this convergence was noted. When we had the opportunity to extend our studies to the 16p11.2 deletion population, we were thrilled to conduct the first comparison between I-ASD and a genetically defined ASD and, as such, the scope of the paper turned towards this comparison. We do agree that analyses of the other I-ASD individuals would be a beneficial endeavor, both to understand how pervasive NPC migration and neurite deficits are in autism and to assess the presence of mTOR dysregulation. Furthermore, it would be important to see whether alterations in other pathways could also lead to similar cell biological deficits, though we know that other studies of neurodevelopmental disorders have found such cellular dysregulations without reporting concurrent mTOR dysregulation. Given our current grant funding to extend these analyses, such experiments within this manuscript would not be feasible.

      Regarding the phenotyping methods used, we decided to assess neurite outgrowth and migration as they are both cytoskeleton dependent processes that are critical for neurodevelopment and are often regulated by the same genes. Furthermore, similar analyses have been applied to Fragile-X Syndrome, 22q11.2 deletion syndrome, and schizophrenia NPCs (Shcheglovitov A. et al., 2013; Mor-Shaked H. et al., 2016; Urbach A. et al., 2010; Kelley D. J. et al., 2008; Doers M. E. et al., 2014; Brennand K. et al., 2015; Lee I. S. et al., 2015; Marchetto M. C. et al., 2011). As such, it seems that multiple underlying etiologies can lead to similar dysregulated cellular phenotypes that can contribute to a variety of neurodevelopmental disorders. On a more global level, there are only a few different cellular functions a developing neuron can undergo, and these include processes such as proliferation, survival, migration, and differentiation. Thus, to understand neurodevelopmental disorders, it is important to study the more “crude” or “global” cellular functions occurring during neurodevelopment to determine whether they are disrupted in disorders such as ASD. In our studies we find that there are indeed dysregulations in many of these basic developmental processes, indicating that the typical steps that occur for normal brain cytoarchitecture may be disrupted in ASD. To understand why, we then further utilized molecular studies to “zoom” in on potential mechanisms which implicated common dysregulation in mTOR signaling as one driver for these common cellular phenotypes. As suggested, we did complete WGS on all the I-ASD individuals and did not see any overlapping genetic variants between the three I-ASD individuals as mentioned in our manuscript. The genetic data was published in a larger manuscript incorporating the data (Zhou A. et al., 2023). However, there were variants that were unique to each I-ASD individual which were not seen in their unaffected family members, and it is possible these variants could be contributing to the I-ASD phenotypes. We also utilized IPA to conduct pathway analysis on the WGS data utilizing the same approach we did in analysis of p- proteome and proteome data. From WGS data, we selected high read-quality variants that were found only in I-ASD individuals and had a functional impact on protein (ie excluding synonymous variants). The enriched pathways obtained from this data were strikingly different from the pathways we found in the p-proteome analysis and are now included in supplemental Figure 6 in the manuscript. Briefly, the top 5 enriched pathways were: O-linked glycosylation, MHC class 1 signaling, Interleukin signaling, Antigen presentation, and regulation of transcription.

      Reviewer #2 (Public Review):

      1) I found that interpreting how differential EF sensitivity is connected to the rest of the story difficult at times. First, it is unclear why these extracellular factors were picked. These are seemingly different in nature (a neuropeptide, a growth factor and a neuromodulator) targeting largely different pathways. This limits the interpretation of the ASD subtype-specific rescue results. One way of reframing that could help is that these are pro-migratory factors instead of EFs broadly defined that fail to promote migration in I-ASD lines due to a shared malfunctioning of the intracellular migration machinery or cell-cell interactions (possibly through tight junction signaling, Fig S2A). Yet, this doesn't explain the migration/neurite phenotypes in 16p11 lines where EF sensitivity is not altered, overall implying that divergent EF sensitivity independent of underlying mTOR state. What is the proposed model that connects all three findings (divergent EF sensitivity based on ASD subtypes, 2 mTOR classes, convergent cellular phenotypes)?

      We thank you for the kind assessment of our manuscript and for the thought-provoking questions posed. In terms of extracellular factors, for our study, we defined extracellular factor as any growth factor, amino acid, neurotransmitter, or neuropeptide found in the extracellular environment of the developing cells. The EFs utilized were selected due to their well-established role in regulation of early neurodevelopmental phenotypes, their expression during the “critical window” of mid-fetal development (as determined by Allan Brain Atlas), and in the case of 5-HT, its association with ASD (Abdulamir H. A. et al., 2018; Adamsen D. et al., 2014; Bonnin A. et al., 2011; Bonnin A. et al., 2007; Chen X. et al., 2015; El Marroun H. et al., 2014; Hammock E. et al., 2012; Yang C. J. et al., 2014; Dicicco-Bloom E. et al., 1998; Lu N. et al., 1998; Suh J. et al., 2001; Watanabe J. et al., 2016; Gilmore J. H. et al., 2003; Maisonpierre P. C. et al., 1990; Dincel N. et al., 2013; Levi- Montalcini R., 1987). Lastly, prior experiments in our lab with a mouse model of neurodevelopmental disorders, had shown atypical responses to EFs (IGF-1, FGF, PACAP). As such, when we first chose to use EFs in human NPCs we wanted to know 1) whether human NPCs even responded to these EFs, 2) whether EFs regulated neurite outgrowth and migration and 3) would there be a differential response in NPCs derived from those with ASD. Our studies were initiated on the I-ASD cohort and given the heterogeneity of ASD we had hypothesized we would get “personalized” neurite and migration phenotypes. Due to this reason, we also wanted to select multiple types of EFs that worked on different signaling pathways. Ultimately, instead of personalized phenotypes we found that all the I-ASD NPCs did not respond to any of the EFs tested whereas the 16p11.2 deletion NPCS did – this was therefore the only difference we found between these two “forms” of ASD. As noted, in I-ASD the lack of response to EFs can be ameliorated by modulating mTOR. However, in the 16p11.2 deletion, despite similar mTOR dysregulation as seen in I-ASD, there is no EF impairment. We do not have a cohesive model to explain why the 16pDel individuals differ from the I-ASD model other than to point to the p- proteomes which do show that the 16pDel NPCs are distinct from the I-ASD NPCs. It seems that mTOR alteration can contribute to impaired EF responsiveness in some NPCs but perhaps there is an additional defect that needs to be present in order for this defect to manifest, or that 16p11.2 deletion NPCs have specific compensatory features. For example, as noted in the thoughtful comment, the p-proteome canonical pathway analysis shows tight junction malfunction in I-ASD which is not present in the 16pDel NPCs and it could be the combination of mTOR dysregulation + dysregulated tight junction signaling that has led to lack of response to EFs in I-ASD. Regardless, we do not think the differences between two genetically distinct ASDs diminish the convergent mTOR results we have uncovered. That is, regardless of whatever defects are present in the ASD NPCs, we are able to rescue it with mTOR modulation which has fascinating implications for treatment and conceptualization for ASD. Lastly, we see our EF studies as an important inclusion as it shows that in some subtypes of ASD, lack of response to appropriate EFs could be contributing to neurodevelopmental abnormalities. Moreover, lack of response to these EFs could have implications for treatment of individuals with ASD (for example, SSRI are commonly used to treat co-morbid conditions in ASD but if an individual is unresponsive to 5- HT, perhaps this treatment is less effective). We have edited the manuscript to include an additional discussion section to address the EFs more thoroughly and have included a few extra sentences in the introduction as well!

      2) A similar bidirectional migration phenotype has been described in hiSPC-derived human cortical interneurons generated from individuals with Timothy Syndrome (Birey et al 2022, Cell Stem Cell). Here, authors show that the intracellular calcium influx that is excessive in Timothy Syndrome or pharmacologically dampened in controls results in similar migration phenotypes. Authors can consider referring to this report in support of the idea that bimodal perturbations of cardinal signaling pathways can converge upon common cellular migration deficits.

      We thank you for pointing out the similar migration phenotype in the Timothy Syndrome paper and have now cited it in our manuscript. We have also expanded on the concept of “too much or too little” of a particular signaling mechanism leading to common outcomes.

      3) Given that authors have access to 8 I-ASD hiPSC lines, it'd very informative to assay the mTOR state (e.g. pS6 westerns) in NPCs derived from all 8 lines instead of the 3 presented, even without assessing any additional cellular phenotypes, which authors have shown to be robust and consistent. This can help the readers better get a sense of the proportion of high mTOR vs low- mTOR classes in a larger cohort.

      We have already addressed this in response to reviewer 1 and the essential revisions section, providing our reasoning for not expanding the study to all 8 I-ASD individuals.

      4) Does the mTOR modulation rescue EF-specific responses to migration as well (Figure 7)

      We did not conduct sufficient replicates of the rescue EF specific responses to migration due to the time consuming and resource intensive nature of the neurosphere experiments. Unlike the neurite experiments, the neurosphere experiments require significantly more cells, more time, selection of neurospheres based on a size criterion, and then manual trace measurements. We did one experiment in Family-1 where we utilized MK-2206 to abolish the response of Sib NPCs to PACAP. Likewise, adding SC-79 to I-ASD-1 neurospheres allowed for response to PACAP.

      Author response image 1.

      Author response image 2.

      Reviewer #3: Public Review

      We appreciate the kind, detailed and very thorough review you provided for us!

      The results on the mTOR signaling pathway as a point of convergence in these particular ASD subtypes is interesting, but the discussion should address that this has been demonstrated for other autism syndromes, and in the present manuscript, there should be some recognition that other signaling pathways are also implicated as common factors between the ASD subtypes.

      With regards to the mTOR pathway, we had included the other ASD syndromes in which mTOR dysregulation has been seen including tuberous sclerosis, Cowden Syndrome, NF-1, as well as Fragile-X, Angelman, Rett and Phelan McDermid in the final paragraph of the discussion section “mTOR Signaling as a Point of Convergence in ASD”. We have now expanded our discussion to include that other signaling pathways such as MAPK, cyclins, WNT, and reelin which have also been implicated as common factors between the ASD subtypes.

      The conclusions of this paper are mostly well supported by data, but for the cell migration assay, it is not clear if the authors control for initial differences in the inner cell mass area of the neurospheres in control vs ASD samples, which would affect the measurement of migration.

      Thank you for this thoughtful comment! When we first started our migration data, inner cell mass size was indeed a major concern for which we controlled in our methods. First, when plating the neurospheres, we would only collect spheres when a majority of spheres were approximately a diameter of 100 um. Very large spheres often could not be imaged due to being out of focus and very small spheres would often disperse when plated. Thus, there were some constraints to the variability of inner cell mass size.

      Furthermore, when we initially collected data, we conducted a proof of principal test to see if initial inner cell mass area (henceforth referred to as initial sphere size or ISS) influenced migration data. To do so, we obtained migration and ISS data from each diagnosis (Sib, NIH, I-ASD, 16pASD). Then we utilized R studio to see if there is a relationship between Migration and ISS in each diagnosis category using the equation (lm(Migration~ISS, data=bydiagnosis). In this equation, lm indicates linear modeling and (~) is a term used to ascertain the relationship between Migration and ISS and the term data=bydiagnosis allows the data to be organized by diagnosis

      The results were expressed as R-squared values indicating the correlation between ISS and Migration for each diagnosis and the p-value showing statistical significance for each comparison. As shown in Author response table 1, for each data set, there is minimal correlation between Migration and ISS in each data set. Moreover, there are no statistically significant relationships between Migration and ISS indicating that initial sphere size DOES NOT influence migration data in any of our data-sets.

      Author response table 1.

      Lastly, utilizing R, we modeled what predicted migration would be like for Sib, NIH, I-ASD, and 16pASD if we accounted for ISS in each group. Raw migration data was then plotted against the predicted data as in Author response image 3.

      Author response image 3.

      As shown in the graph, there are no statistical differences between the raw migration data (the data that we actually measured in the dish) and the modeled data in which ISS is accounted for as a variable. As such, we chose not to normalize to or account for ISS in our other experiments. We have now included the above R studio analyses in our supplemental figures (Figure S1) as well.

      Also, in Fig 5 and 6, panels I and J omit the effects of drug on mTOR phosphorylation as shown for other conditions.

      Both SC-79 and MK2206 were selected in our experiments after thorough analysis of their effects on human epithelial cells and other cultured cells (citations in manuscript). However, initially, we did not know whether either of these drugs would modulate the mTOR pathway in human NPCs, thus, in Figures 5A,5D, 6A and 6D we chose to focus on two of our data-sets to establish the effect of these drugs in human NPCs. Our experiments in Family-1 and Family-2 showed us that SC-79 increases PS6 in human NPCs while MK-2206 downregulates it. Once this was established, we knew the drugs would have similar effects in the NPCs from the other families. Thus, we only conducted a proof of principle test to confirm the drug does indeed have the intended effect in I-ASD-3 and 16pDel. We have included these proof of principle westerns in Figure 5I, 5K, 6I and 6K to show that the effects of these drugs are reproducible across all our NPC lines. We did not include quantification since the data is only from our single proof of principle western.

    1. Author Response

      Reviewer #1 (Public Review):

      Zhu et al. found that human participants could plan routes almost optimally in virtual mazes with varying complexity. They further used eye movements as a window to reveal the cognitive computations that may underly such close-to-optimal performance. Participants’ eye movement patterns included: (1) Gazes were attracted to the most task-relevant transitions (effectively the bottleneck transitions) as well as to the goal, with the share of the former increasing with maze complexity; (2) Backward sweeps (gazes moving from goal to start) and forward sweeps (gazes from start to goal) respectively dominated the pre-movement and movement periods, especially in more complex mazes. The authors explained the first pattern as the consequence of efficient strategies of information collection (i.e., active sensing) and connected the second pattern to neural replays that relate to planning.

      The authors have provided a comprehensive analysis of the eye movement patterns associated with efficient navigation and route planning, which offers novel insights for the area through both their findings and methodology. Overall, the technical quality of the study is high. The "toggling" analysis, the characterization of forward and backward sweeps, and the modeling of observers with different gaze strategies are beautiful. The writing of the manuscript is also elegant.

      I do not see any weaknesses that cannot be addressed by extended data analysis or modeling. The following are two major concerns that I hope could be addressed.

      We thank the reviewer for their positive assessment of our work!

      First, the current eye movement analysis does not seem to have touched the core of planning-evaluating alternative trajectories to the goal. Instead, planning-focused analyses such as forward and backward sweeps were all about the actually executed trajectory. What may participants’ eye movements tell us about their evaluation of alternative trajectories?

      This is an important point that we previously overlooked because our experimental design did not incorporate mutually exclusive alternative trajectories. Nonetheless, there are many trials in which participants had access to several possible trajectories to the goal. Some of those alternatives may be trivially suboptimal (e.g. highly convoluted trajectory, taking a slightly curved instead of straight trajectory, or setting out on the wrong path and then turning back). Using two simple constraints described in the Methods (no cyclic paths, limited amount of overlap between alternatives), we algorithmically identified the number of non-trivial alternative trajectories (or options) on each trial that were comparable in length to the chosen trajectory (within about 1 standard deviation). A few examples are shown below for the reviewer.

      The more plausible trajectory options there were, the more time participants spent gazing upon these alternatives during both pre-movement and movement (Figure 4 – figure supplement 1D – left). This is not a trivial effect resulting from the increase in surface area comprising the alternative paths because the time spent looking at the chosen trajectory also increased with the number of alternatives (Figure S8D – middle). Instead, this suggests that participants might be deliberating between comparable options.

      Consistent with this, the likelihood of gazing alternative trajectories peaked early on during pre-movement and well before performing sweeping eye movements (Figure 5D). During movement, the probability of gazing upon alternatives increases immediately before participants make a turn, suggesting that certain aspects of deliberation may also be carried out on the fly just before approaching choice points. Critically, during both pre-movement and movement epochs, the fraction of time spent looking at the goal location decreased with the number of alternatives (Figure 4 – figure supplement 1D – right), revealing a potential trade-off between deliberative processing and looking at the reward location. Future studies with more structured arena designs are needed to better understand the factors that lead to the selection of a particular trajectory among alternatives, and we mention this in the discussion (line 445):

      "Value-based decisions are known to involve lengthy deliberation between similar alternatives. Participants exhibited a greater tendency to deliberate between viable alternative trajectories at the expense of looking at the reward location. Likelihood of deliberation was especially high when approaching a turn, suggesting that some aspects of path planning could also be performed on the fly. More structured arena designs with carefully incorporated trajectory options could help shed light on how participants discover a near-optimal path among alternatives. However, we emphasize that deliberative processing accounted for less than onefifth of the spatial variability in eye movements, such that planning largely involved searching for a viable trajectory."

      Second, what cognitive computations may underly the observed patterns of eye movements has not received a thorough theoretical treatment. In particular, to explain why participants tended to fixate the bottleneck transitions, the authors hypothesized active sensing, that is, participants were collecting extra visual information to correct their internal model about the maze. Though active sensing is a possible explanation (as demonstrated by the authors’ modeling of "smart" observers), it is not necessarily the only or most parsimonious explanation. It is possible that their peripheral vision allowed participants to form a good-enough model about the maze and their eye movements solely reflect planning. In fact, that replays occur more often at bottleneck states is an emergent property of Mattar & Daw’s (2018) normative theory of neural replay. Forward and backward replays are also emergent properties of their theory. It might be possible to explain all the eye movement patterns-fixating the goal and the bottleneck transitions, and the forward and backward replays-based on Mattar & Daw’s theory in the framework of reinforcement learning. Of course, some additional assumptions that specify eye movements and their functional roles in reinforcement learning (e.g., fixating a location is similar to staying at the corresponding state) would be needed, analogous to those in the authors’ "smart" observer models. This unifying explanation may not only be more parsimonious than the author’s active sensing plus planning account, but also be more consistent with the data than the latter. After all, if participants had used fixations to correct their internal model of the maze, they should not have had little improvements across trials in the same maze.

      We thank the reviewer for this reference. We note the strong parallels between our eye movement results and that study in the discussion, in addition to proposing experimental variations that will help crystallize the link. Below, we included our response that was incorporated into the Discussion section (beginning at line 462).

      "In [a] highly relevant theoretical work, Mattar and Daw proposed that path planning and structure learning are variants of the same operation, namely the spatiotemporal propagation of memory. The authors show that prioritization of reactivating memories about reward encounters and imminent choices depends upon its utility for future task performance. Through this formulation, the authors provided a normative explanation for the idiosyncrasies of forward and backward replay, the overrepresentation of reward locations and turning points in replayed trajectories, and many other experimental findings in the hippocampus literature. Given the parallels between eye movements and patterns of hippocampal activity, it is conceivable that gaze patterns can be parsimoniously explained as an outcome of such a prioritization scheme. But interpreting eye movements observed in our task in the context of the prioritization theory requires a few assumptions. First, we must assume that traversing a state space using vision yields information that has the same effect on the computation of utility as does information acquired through physical navigation. Second, peripheral vision allows participants to form a good model of the arena such that there is little need for active sensing. In other words, eye movements merely reflect memory access and have no computational role. Finally, long-term statistics of sweeps gradually evolve with exposure, similar to hippocampal replays. These assumptions can be tested in future studies by titrating the precise amount of visual information available to the participants, and by titrating their experience and characterizing gaze over longer exposures. We suspect that a pure prioritization-based account might be sufficient to explain eye movements in relatively uncluttered environments, whereas navigation in complex environments would engage mechanisms involving active inference. Developing an integrative model that features both prioritized memory-access as well as active sensing to refine the contents of memory, would facilitate further understanding of computations underlying sequential decision-making in the presence of uncertainty."

      In the original manuscript, we referred to active sensing and planning in order to ground our interpretation in terminology that has been established in previous works by other groups, which had investigated them in isolation. Although the role active sensing could be limited, we are unable to conclude that eye movements solely reflect planning. Even if peripheral vision is sufficient to obtain a good-enough model of the environment, eye movements can further reduce uncertainty about the environment structure especially in cluttered environments such as the complex arena used in this study. This reduction in uncertainty is not inconsistent with a lack of performance improvement across trials. This is because the lack of improvement could be explained by a failure to consolidate the information gathered by eye movements and propagate them across trials, an interpretation that would also explain why planning duration is stable across trials (Figure 2 – figure supplement 2B). Furthermore, participants gaze at alternative trajectories more frequently when more options are presented to them. However we acknowledge that this is a fundamental question, and identified this as an important topic for follow up studies and outline experiments to delineate the precise extent to which eye movements reflect prioritized memory access vs active sensing. Briefly, we can reduce the contribution of active sensing by manipulating the amount of visual information – ranging from no information (navigating in the dark) to partial information (foveated rendering in VR headset). Likewise, we can increase the contribution of memory by manipulating the length of the experiment to ensure participants become fully familiar with the arena. Yet another manipulation is to use a fixed reward location for all trials such that experimental conditions would closely match the simulations of the prioritization model. We are excited about performing these follow up experiments.

      Reviewer #2 (Public Review):

      In this study the authors sought to understand how the patterns of eye-movements that occur during navigation relate to the cognitive demands of navigating the current environment. To achieve this the authors developed a set of mazes with visible layouts that varied in complexity. Participants navigated these environments seated on a chair by moving in immersive virtual reality.

      The question of how eye-movements relate to cognitive demands during navigation is a central and often overlooked aspect of navigating an environment. Study eye-movements in dynamic scenarios that enable systematic analysis is technically challenging, and hence why so few studies have tackled this issue.

      The major strengths of this study are the technical development of the set up for studying, recording and analysing the eye-movements. The analysis is extensive and allows greater insight than most studies exploring eye-movements would provide. The manuscript is also well written and argued.

      A current weakness of the manuscript is that several other factors have not been considered that may relate to the eye-movements. More consideration of these would be important.

      We thank the reviewer for their positive assessment of the innovative aspects of this study. We have tried to address the weaknesses by performing additional analyses described below.

      1. In the experimental design it appears possible to separate the length of the optimal path from the complexity of the maze. But that appears not to have been done in this design. It would be useful for the authors to comment on this, as these two parameters seem critically important to the interpretation of the role of eye-movements - e.g. a lot of scanning might be required for an obvious, but long path, or a lot of scanning might be required to uncover short path through a complex maze.

      This is a great point. We added a comment to the Discussion at line 489 to address this:

      "Future work could focus on designing more structured arenas to experimentally separate the effects of path length, number of subgoals, and environmental complexity on participants’ eye movement patterns."

      To make the most of our current design, we performed two analyses. First, we regressed trial-specific variables simultaneously against path length and arena complexity. This analysis revealed that the effect of complexity on behavior persists even after accounting for path length differences across arenas (Figure 4 – figure supplement 3). Second, path length is but one of many variables that collectively determine the complexity of the maze. Therefore, we also analyzed the effects of multiple trial-specific variables (number of turns, length of the optimal path, and the degree to which participants are expected to turn back the initial direction of heading to reach the goal, regardless of arena complexity) on eye movements. This revealed fine-grained insights on which task demands most influenced each eye movement quality that was described. More complex arenas posed, on average, greater challenges in terms of longer and more winding trajectories, such that eye movement qualities which increased with arena complexity also generally increased with specific measures of trial difficulty, albeit to varying degrees. We added additional plots to the main/supplementary figures and described these analyses under a new heading (“Linear mixed effects models”) in the Methods section.

      1. Similarly, it was not clear how the number of alternative plausible paths was considered in the analysis.It seems possible to have a very complex maze with no actual required choices that would involve a lot of scanning to determine this, or a very simple maze with just two very similar choices but which would involve significant scanning to weight up which was indeed the shortest.

      Thank you for the suggestion. In conjunction with our response to the first comment from Reviewer #1, we used some constraints to identify non-trivial alternative trajectories – trajectories that pass through different locations in the arena but are roughly similar in length (within about 1 SD of the chosen trajectory). In alignment with your intuition, the most complex maze, as well as the completely open arena, did not have non-trivial alternative trajectories. For the three arenas of medium complexity, the more open arenas had more non-trivial alternative trajectories.

      When we analyzed the relative effect of the number of alternative trajectories on eye movement, we found that both possibilities you suggested are true. On trials with many comparable alternatives, participants indeed spend more time scanning the alternatives and less time looking at the goal (Figure S8D). Likewise, in the most complex maze where there are no alternatives, participants still spent much more time (than simpler mazes) learning about the arena structure at the expense of looking at the goal (Figure 3E-F). This analysis yielded interesting new insights into how participants solved the task and opens the door for investigating this trade-off in future work. More generally, because both deliberation and structure learning appear to drive eye movements, they must be factored into studies of human planning.

      1. Can the affordances linked to turning biases and momentum explain the error patterns? For example,paths that require turning back on the current trajectory direction to reach the goal will be more likely to cause errors, and patterns of eye-movements that might be related to such errors.

      Thank you for this question. In conjunction with the trial-specific analyses on the effect of the length of the trajectory (Point #1) on errors and eye movement patterns, we also looked into how the number of turns and the relative bearing (angle between the direction of initial heading and the direction of target approach) affected participants’ behavior. Turns and momentum do not affect the relative error (distance of the stopping location to the target) as much as the trajectory length does, which was unexpected (Figure 1 – figure supplement 1F). This supports that errors were primarily caused by forgetting the target location, and this memory leak gets worse with distance (or time). However, turns have an influence on eye movements in general. For example, more turns generally result in an increase in the fraction of time that participants spend gazing upon the trajectory (Figure 4 – figure supplement 1A) and sweeping (Figure 4D). Furthermore, the number of turns decreased the fraction of time participants spent gazing at the target during movement (Figure 2D).

      1. Why were half the obstacle transitions miss-remembered for the blind agent? This seems a rather arbitrary choice. More information to justify this would be useful.

      We tested out different percentages and found qualitatively similar results. The objective was to determine the patterns of eye movements that would be most beneficial when participants have an intermediate level of knowledge about the arena configuration (rather than near-zero or near-perfect), because during most trials, participants can also use peripheral vision to assess the rough layout, but they do not precisely remember the location of the obstacles. We added this explanation to Appendix 1, where the simulation details have been made in response to a suggestion by another reviewer.

      1. The description of some of the results could usefully be explained in more simple terms at various pointsto aid readers not so familiar with the RL formation of the task. For example, a key result reported is that participants skew looking at the transition function in complex environments rather than the reward function. It would be useful to relate this to everyday scenarios, in this case broadly to looking more at the junctions in the maze than at the goal, or near the goal, when the maze is complex.

      This is a great suggestion. We added an everyday analogy when describing the trade-off on line 258.

      "The trade-off reported here is roughly analogous to the trade-off between looking ahead towards where you’re going and having to pay attention to signposts or traffic lights. One could get away with the former strategy while driving on rural highways whereas city streets would warrant paying attention to many other aspects of the environment to get to the destination."

      1. The authors should comment on their low participant sample size. The sample seems reasonable giventhe reproducibility of the patterns, but it is much lower than most comparable virtual navigation tasks.

      Thank you for the recommendation. We had some difficulties recruiting human participants who were willing to wear a headset which had been worn by other participants during COVID-19, and some participants dropped out of the study due to feeling motion sickness. To ameliorate the low sample size, we collected data on four more participants and performed analyses to confirm that the major findings may be observed in most individual participants. Participant-specific effects are included in the new plots made in response to Points # 1-3, and the number of participants with a significant result for each figure/panel has been included as Appendix 2 – table 3.

      Reviewer #3 (Public Review):

      In this article, Zhu and colleagues studied the role of eye movements in planning in complex environments using virtual reality technology. The main findings are that humans can 1) near optimally navigate in complex environments; 2) gaze data revealed that humans tend to look at the goal location in simple environments, but spend more time on task relevant structures in more complex tasks; 3) human participants show backward and forward sweeping mostly during planning (pre-movement) and execution (movement), respectively.

      I think this is a very interesting study with a timely question and is relevant to many areas within cognitive neuroscience, notably decision making, navigation. The virtual reality technology is also quite new for studying planning. The manuscript has been written clearly. This study helps with understanding computational principles of planning. I enjoyed reading this work. I have only one major comment about statistical analyses that I hope authors can address.

      We thank the reviewer for the accurate description and positive assessment of our work.

      Number of subjects included in analyses in the study is only nine. This is a very small sample size for most human studies. What was the motivation behind it? I believe that most findings are quite robust, but still 9 subjects seems too low. Perhaps authors can replicate their finding in another sample? Alternatively, they might be able to provide statistics per individual and only report those that are significant in all subjects (of course, this only works if reported effects are super robust. But only in such a case 9 subjects are sufficient.)

      Thank you for the suggested alternatives. Due to the pandemic, we had some difficulties recruiting human participants who were willing to wear a headset which had been worn by other participants. We collected data on four more participants and included them in the analyses, and also confirmed that the major findings are observed in most individuals. The number of participants with a significant result for each analysis has been included in Figure 1 – figure supplement 3 and Appendix 2 – table 3.

      Somewhat related to the previous point, it seems to me that authors have pooled data from all subjects (basically treating them as 1 super-subject?) I am saying this based on the sentence written on page 5, line 130: "Because we are interested in principles that are conserved across subjects, we pooled subjects for all subsequent analyses." If this is not the case, please clarify that (and also add a section on "statistical analyses" in Methods.) But if this is the case, it is very problematic, because it means that statistical analyses are all done based on a fixed-effect approach. The fixed effect approach is infamous for inflated type I error.

      Your interpretation is correct and we acknowledge your concern about pooling participants. We had done this after observing that our results were consistent across participants but this was not demonstrated. We have now performed analyses sensitive to participant-specific effects and find that all major results hold for most participants, and we included additional main and supplementary bar plots (and tables in Appendix 2) showing per-participant data. The new plots/table show the effect of independent variables (mainly trial/arena difficulty) on dependent variables for each participant, as well as general effects conserved across participants. A new paragraph was added to the Methods section to describe the “Linear mixed effects models” which we used.

      Again, quite related to the last two points: please include degrees of freedom for every statistical test (i.e. every reported p-value).

      Degrees of freedom (df) are now included along with each p-value.

    1. Author Response

      Reviewer #1 (Public Review):

      Using fMRI-based univariate and multivariate analyses, Root, Muret, et al. investigated the topography of face representation in the somatosensory cortex of typically developed two-handed individuals and individuals with a congenital and acquired missing hand. They provide clear evidence for an upright face topography in the somatosensory cortex in all three groups. Moreover, they find that one-handers, but not amputees, show shorter distances from lip representations to the hand area, suggesting a remapping of the lips. They also find a shift away of the upper face from the deprived hand area in one-handers, and significantly greater dissimilarity between face part representations in amputees and one-handers. The authors argue that this pattern of remapping is different to that of cortical neighborhood theories and points toward a remapping of face parts which have the ability to compensate for hand function, e.g., using the lips/mouth to manipulate an object.

      These findings provide interesting insights into the topographic organization of face parts and the principles of cortical (re)organization. The authors use several analytical approaches, including distance measures between hand- and face-part-responsive regions and representational similarity analysis (RSA). Particularly commendable is the rigorous statistical analysis, such as the use of Bayesian comparisons, and careful interpretation of absent group differences.

      We thank the reviewer for their positive and constructive feedback.

      Reviewer #2 (Public Review):

      After amputation, the deafferented limb representation in the somatosensory cortex is activated by stimulation of other body parts. A common belief is that the lower face, including the lips, preferentially "invades" deafferented cortex due to its proximity to cortex. In the present study, this hypothesis is tested by mapping the somatosensory cortex using fMRI as amputees, congenital one-handers, and controls moved their forehead, nose, lips or tongue. First, they found that, unlike its counterpart in monkeys, the representation of the face in the somatosensory cortex is right-side up, with the forehead most medial (and abutting the hand) and the lips most lateral. Second, there was little evidence of "reorganization" of the deafferented cortex in amputees, even when tested with movements across the entire face rather than only the lips. Third, congenital one-handers showed significant reorganization of deafferented cortex, characterized principally by the invasion of the lower face, in contrast to predictions from the hypothesis that proximity was the driving factor. Fourth, there was no relationship between phantom limb pain reports and reorganization.

      As a non-expert in fMRI, I cannot evaluate the methodology. That being said, I am not convinced that the current consensus is that the representation of the face in humans is flipped compared to that of monkeys. Indeed, the overwhelming majority of somatosensory homunculi I have seen for humans has the face right side up. My sense is that the fMRI studies that found an inverted (monkey-like) face representation contradict the consensus.

      Thank you for point this out. As we tried to emphasise in the introduction, very few neuroimaging studies actually investigated face somatotopy in humans, with inconsistent results. We agree the default consensus tends to be dominated by the up-right depiction of Penfield’s homunculus (recently replicated by Roux et al, 2018). However, due to methodological and practical constraints, alignment across subjects in the case of intracortical recordings is usually difficult to achieve, and thus makes it difficult to assess the consistency in topographical organisation. Moreover, previous imaging studies did not manage to convincingly support Penfield’s homunculus. For these two key reasons, the spatial orientation of the human facial homunculus is still debated. A further limiting factor of previous studies in humans is that the vast majority of human studies investigating face (re)mapping in humans focused solely on the lip representation, using the cortical proximity hypothesis to interpret their results. Consequently, as we highlight above in our response to the Editor, there is a wide-spread and false representation in the human literature of the lips neighbouring the hand area.

      To account for the reviewer’s critic and convey some of this context, we changed our title from: Reassessing face topography in primary somatosensory cortex and remapping following hand loss; to: Complex pattern of facial remapping in somatosensory cortex following congenital but not acquired hand loss. This was done to de-emphasise the novelty of face topography relative to our other findings.

      We also rewrote our introduction (lines 79-94) as follows:

      “The research focus on lip cortical remapping in amputees is based on the assumption that the lips neighbour the hand representation. However, this assumption goes against the classical upright orientation of the face in S126–30, as first depicted in Penfield’s Homunculus and in later intracortical recordings and stimulation studies26–29, with the upper-face (i.e., forehead) bordering the hand area. In contrast, neuroimaging studies in humans studying face topography provided contradictory evidence for the past 30 years. While a few neuroimaging studies provided partial evidence in support of the traditional upright face organisation31, other studies supported the inverted (or ‘upside-down’) somatotopic organisation of the face, similar to that of non-human primates32,33. Other studies suggested a segmental organisation34, or even a lack of somatotopic organisation35–37, whereas some studies provided inconclusive or incomplete results38–41. Together, the available evidence does not successfully converge on face topography in humans. In line with the upright organisation originally suggested by Penfield, recent work reported that the shift in the lip representation towards the missing hand in amputees was minimal42,43, and likely to reside within the face area itself. Surprisingly, there is currently no research that considers the representation of other facial parts, in particular the upper-face (e.g., the forehead), in relation to plasticity or PLP.”

      We also updated the discussion accordingly (lines 457, 469-477, 490-492).

      Similarly, it is not clear to me how the observations (1) of limited reorganization in amputees, (2) of significant reorganization in congenital one-handers, and (3) of the lack of relationship between PLP and reorganization is novel given the previous work by this group. Perhaps the authors could more clearly articulate the novelty of these results compared to their previous findings.

      Thank you for giving us the opportunity to clarify on this important point. The novelty of these results can be summarised as follow:

      (1) Conceptually, it is crucial for us to understand if deprivation-triggered plasticity is constrained by the local neighbourhood, because this can give us clues regarding the mechanisms driving the remapping. We provide strong topographic evidence about the face orientation in controls, amputees and one-handers.

      (2) The vast majority of previous research on brain plasticity following hand loss (both congenital and acquired) in humans has exclusively focused on the lower face, and lips in particular. We provide systematic evidence for stable organisation and remapping of the neighbouring upper face, as well as the lower face. We also study topographic representation of the tongue (and nose) for the first time.

      (3) The vast majority of previous research on brain remapping following hand loss (both congenital and acquired, neuroimaging and electrophysiological) was focused on univariate activity measures, such as the spatial spread of units showing a similar feature preference, or the average activity level across individual units. We are going beyond remapping by using RSA, which allows us to ask not only if new information is available in the deprived cortex (as well as the native face area), but also whether this new information is structured consistently across individuals and groups. We show that representational content is enhanced in the deprived cortex one-handers whereas it is stable in amputees relative to controls (and to their intact hand region).

      (4) Based on previous studies, the assumption was that reorganisation in congenital one-handers was relatively unspecific, affecting all tested body parts. Here, we provide evidence for a more complex pattern of remapping, with the forehead representation seemingly moving out of the missing hand region (and the nose representation being tentatively similar to controls). That is, we show not just “invasion” but also a shift of the neighbour away from the hand area which has never been documented (or in fact suggested).

      (5) Using Bayesian analyses we provide definitive evidence against a relationship between PLP and forehead remapping, providing first and conclusive evidence against the remapping hypothesis, based on cortical neighbourhood.

      Our inclination is not to add a summary paragraph of these points in our discussion, as it feels too promotional. Instead, we have re-written large sections of the introduction and discussion to better emphasise each of these points separately throughout the text, where the context is most appropriate. Given the public review strategy taken by eLife, the novelty summary provided above will be available for any interested reader, as part of the public review process. However, should the reviewer feel that a novelty summary paragraph is required (or an emphasis on any of the points summarised above), we will be happy to revise the manuscript accordingly.

      Finally, Jon Kaas and colleagues (notably Niraj Jain) have provided evidence in experiments with monkeys that much of the observed reorganization in the somatosensory cortex is inherited from plasticity in the brain stem. Jain did not find an increased propensity for axons to cross the septum between face and hand representations after (simulated) amputation. From this perspective, the relevant proximity would be that of the cuneate and trigeminal nuclei and it would be critical to map out the somatotopic organization of the trigeminal and cuneate nuclei to test hypotheses about the role of proximity in this remapping.

      Thank you for highlighting this very relevant point, which we are well aware of. We fully agree with the reviewer that this is an important goal for future study, but functional imaging of the brainstem in humans is particularly challenging and would require ultra high field imaging (7T) and specialised equipment. We have encountered much local resistance due to hypothetical issues for MRI safety for scanning amputees in this higher field strength, meaning we are unable to carry out this research ourselves. Our former lab member Sanne Kikkert, who is now running her independent research programme in Zurich, has been working towards this goal for the past 4 years. So we can say with confidence that this aim is well beyond the scope of the current study. In response to your comment, we mentioned this potential mechanism in the introduction (lines 98-101), we ensured that we only referred to “cortical proximity” throughout our manuscript, and we circle back to this important point in the discussion.

      Lines 539-543: “Moreover, even if the remapping we observed here goes against the theory of cortical proximity, it can still arise from representational proximity at the subcortical level, in particular at the brainstem level44,45. While challenging in humans, mapping both the cuneate and trigeminal nuclei would be critical to provide a more complete picture regarding the role of proximity in remapping.”

      Reviewer #3 (Public Review):

      In their study, the authors set up to challenge the long-held claim that cortical remapping in the somatosensory cortex in hand deprived cortical territories follows somatotopic proximity (the hand region gets invaded by cortical neighbors) as classically assumed. In contrast to this claim, the authors suggest that remapping may not follow cortical proximity but instead functional rules as to how the effector is used. Their data indeed suggest that the deprived hand area is not invaded by the forefront which is the cortical neighbor but instead by the lips which may compensate for hand loss in manipulating objects. Interestingly the authors suggest this is mostly the case for one-handers but not in amputees for who the reorganization seems more limited in general (but see my comments below on this last point).

      This is a remarkably ambitious study that has been skilfully executed on a strong number of participants in each group. The complementarity of state-of-the-art uni- and multi-variate analyses are in the service of the research question, and the paper is clearly written. The main contribution of this paper, relative to previous studies including those of the same group, resides in the mapping of multiple face parts all at once in the three groups.

      We are grateful to the reviewer for appreciating the immense effort that this study involved.

      In the winner takes all approach, the authors only include 3 face parts but exclude from the analyses the nose and the thumb. I am not fully convinced by the rationale for not including nose in univariate analyses - because it does not trigger reliable activity - while keeping it for representational similarity analyses. I think it would be better to include the nose in all analyses or demonstrate this condition is indeed "noisy" and then remove it from all the analyses. Indeed, if the activity triggered by nose movement is unreliable, it should also affect multivariate.

      Following this comment, we re-ran all univariate analyses to include the nose, and updated throughout the main text and supplemental results and related figures. In short, adding the nose did not change the univariate results, apart from a now significant group x hemisphere interaction for the CoG of the tongue when comparing amputees and controls, matching better the trends for greater surface coverage in the deprived hand ROI of amputees. Full details are provided in our response to Reviewer 1 above.

      The rationale for not including the hand is maybe more convincing as it seems to induce activity in both controls and amputees but not in one-handers. First, it would be great to visualize this effect, at least as supplemental material to support the decision. Then, this brings the interesting possibility that enhanced invasion of hand territory by lips in one-handers might link to the possibility to observe hand-related activity in the presupposed hand region in this population. Maybe the authors may consider linking these.

      Thank you for this comment. As we explain in our response to Reviewer 1 above, we did not intent the thumb condition in one-handers for analysis, as the task given to one-handers (imagine moving a body part you never had before) is inherently different to that given to the other groups (move - or at least attempt to move - your (phantom) hand). As such, we could not pursuit the analysis suggested by the reviewer here. To reduce the discrepancy and following Reviewer 1’s advice, we decided to remove the hand-face dissimilarity analysis which we included in our original manuscript, and might have sparked some of this interest. Upon reflection we agreed that this specific analysis does not directly relate to the question of remapping (but rather of shared representation), in addition to making the paper unbalanced. We will now feature this analysis in another paper that appears more appropriate in the context of referred sensations in amputees (Amoruso et al, 2022 MedRxiv).

      The use of the geodesic distance between the center of gravity in the Winner Take All (WTA) maps between each movement and a predefined cortical anchor is clever. More details about how the Center Of Gravity (COG) was computed on spatially disparate regions might deserve more explanations, however.

      We are happy to provide more detail on this analysis, which weights the CoG based on the clusters size (using the workbench command -metric-weighted-stats). Let’s consider the example shown here (Figure 1) for a single control participant, where each CoG is measured either without weighting (yellow vertices) or with cluster weighting (forehead CoG=red, lip CoG=dark blue, tongue CoG=dark red). When the movement produces a single cluster of activity (the lips in the non-dominant hemisphere, shown in blue), the CoG’s location was identical for both weighted (red) and unweighted (yellow) calculations. But other movements, such as the tongue (green), produced one large cluster (at the lateral end), with a few more disparate smaller clusters more medially. In this case, the larger cluster of maximal activity is weighted to a greater extent than the smaller clusters in the CoG calculation, meaning the CoG is slightly skewed towards it (dark red), relative to the smaller clusters.

      Figure 1. Centre-of-gravity calculation, weighted and unweighted by cluster size, in an example control participant. Here the winner-takes-all output for each facial movement (forehead=red, lips=blue, tongue=green) was used to calculate the centre-of-gravity (CoG) at the individual-level in both the dominant (left-hand side) and non-dominant (right-hand side) hemisphere, weighted by cluster size (forehead CoG=red, lip CoG=dark blue, tongue CoG=dark red), compared to an unweighted calculation (denoted by yellow dots within each movements’ winner-takes-all output).

      This is now explained in the methods (lines 760-765) as follows:

      “To assess possible shifts in facial representations towards the hand area, the centre-of-gravity (CoG) of each face-winner map was calculated in each hemisphere. The CoG was weighted by cluster size meaning that in the event of multiple clusters contributing to the calculation of a single CoG for a face-winner map, the voxels in the larger cluster are overweighted relative to those in the smaller clusters. The geodesic cortical distance between each movement’s CoG and a predefined cortical anchor was computed.”

      Moreover, imagine that for some reason the forefront region extends both dorsally and ventrally in a specific population (eg amputees), the COG would stay unaffected but the overlap between hand and forefront would increase. The analyses on the surface area within hand ROI for lips and forehead nicely complement the WTA analyses and suggest higher overlap for lips and lower overlap for forehead but none of the maps or graphs presented clearly show those results - maybe the authors could consider adding a figure clearly highlighting that there is indeed more lip activity IN the hand region.

      We agree with you on this limitation of the CoG and this is why we interpret all cortical distances analyses in tandem with the laterality indices. The laterality indices correspond to the proportion of surface area in the hand region for a given face part in the winner-maps.

      Nevertheless, to further convince the Reviewer, we extracted activity levels (beta values) within the hand region of congenitals and controls, and we ran (as for CoGs) a mixed ANOVA with the factors Hemisphere (deprived x intact) and Group (controls x one-handers).

      As expected from the laterality indices obtained for the Lips, we found a significant group x hemisphere interaction (F(1,41)=4.52, p=0.040, n2p=0.099), arising from enhanced activity in the deprived hand region in one-handers compared to the non-dominant hand region in controls (t(41)=-2.674, p=0.011) and to the intact hand region in one-handers (t(41)=-3.028, p=0.004).

      Since this kind of analysis was the focus of previous studies (from which we are trying to get away) and since it is redundant with the proportion of face-winner surface coverage in the hand region, we decided not to include it in the paper. But we could add it as a Supplementary result if the Reviewer believes this strengthens our interpretation.

      In addition to overlap analyses between hand and other body parts, the authors may also want to consider doing some Jaccard similarity analyses between the maps of the 3 groups to support the idea that amputees are more alike controls than one-handers in their topographic activity, which again does not appear clear from the figures.

      We thank the reviewers for this clever suggestion. We now include the Jaccard similarity analysis, which quantified the degree of similarity (0=no overlap between maps; 1=fully overlapping) between winner-takes-all maps (which included the nose; akin to the revised univariate results) across groups. For each face part/amputee, the similarity with the 22 controls and 21 one-handers respectively was averaged. We utilised a linear mixed model which included fixed factors of Group (One-handers x Controls), Movement (Forehead x Nose x Lips x Tongue) and Hemisphere (Intact x Deprived) on Jaccard similarity values (similar to what we used for the RSA analysis). A random effect of participant, as well as covariates of ages, were also included in the model.

      Results showed a significant group x hemisphere interaction (F(240.0)=7.70, p=0.006; controlled for age; Fig. 5), indicating that amputees’ maps showed different similarity values to controls’ and one-handers’ depending on the hemisphere. Post-hoc comparisons (corrected alpha=0.025; uncorrected p-values reported) revealed significantly higher similarity to controls’ than to one-handers’ maps in the deprived hemisphere (t(240)=-3.892, p<.001). Amputees’ maps also showed higher similarity to controls’ maps in the deprived relative to the intact hemisphere (t(240)=2.991, p=0.003). Amputees, therefore, displayed greater similarity of facial somatotopy in the deprived hemisphere to controls, suggesting again fewer evidence for cortical remapping in amputees.

      We added these results at the end of the univariate analyses (lines 335-351) and in the discussion (lines 464-465 and 497-500).

      This brings to another concern I have related to the claim that the change in the cortical organization they observe is mostly observed in one-handers. It seems that most of this conclusion relies on the fact that some effects are observed in one-handers but not in amputees when compared to controls, however, no direct comparisons are done between amputees and one-handers so we may be in an erroneous inference about the interaction when this is actually not tested (Nieuwenhuis, 11). For instance, the shift away from the hand/face border of the forehead is also (mildly) significant in amputees (as observed more strongly in one-handers) so the conclusion (eg from the subtitle of the results section) that it is specific to one-hander might not fully be supported by the data. Similar to the invasion of the hand territory from the lips which is significant in amputees in terms of surface area. All together this calls for toning down the idea that plasticity is restricted to congenital deprivation (eg last sentence of the abstract). Even if numerically stronger, if I am not wrong, there are no stats showing remapping is indeed stronger in one-handers than in amputees and actually, amputees show significant effects when compared to controls along the lines as those shown (even if more strongly) in one-handers.

      Thank you for this very important comment. We fully agree – the RSA across-groups comparison is highly informative but insufficient to support our claims. We did not compare the groups directly to avoid multiple comparisons (both for statistical reasons and to manage the size of the results section). But the reviewer’s suggestion to perform a Jaccard similarity analysis complements very nicely the univariate and multivariate results and allows for a direct (and statistically lean) comparison between groups, to assess whether amputees are more similar to controls or to congenital one-handers, taking into account all aspects of their maps (both spatial location/CoG and surface coverage). We added the Jaccard analysis to the main text, at the end of the univariate results (lines 335-385). The Jaccard analysis suggests that amputees’ maps in the deprived hemisphere were more similar to the maps of controls than to the ones of congenital one-handers. This allowed us to obtain significant statistical results to support the claim that remapping is indeed stronger in one-handers than in amputees (lines 346-351). We also compared both amputees and one-handers to the control group. In line with our univariate results, this revealed that the only face part for which controls were more similar to one-handers than to amputees was the tongue (lines 379-381). And that the forehead remapping observed at the univariate level in amputees (surface area), is likely to arise from differences in the intact hemisphere (lines 381-383).

      Finally, we also added the post-hoc statistics comparing amputees to congenitals in the RSA analysis (lines 425-427): “While facial information in the deprived hand area was increased in one-handers compared with amputees, this effect did not survive our correction for multiple comparisons (t(70.7)=-2.117, p=0.038).”

      Regarding the univariate results mentioned by the reviewer, we would like to emphasise that we had no significant effect for the lips in amputees, though we agree the surface area appears in between controls and one-handers. But this laterality index was not different from zero. This test is now added lines 189-190. Regarding the forehead, we fully agree with the Reviewer, and we adjusted the subtitle accordingly (lines 241-242). For consistency, we also added the t-test vs zero for the forehead surface area (non-significant, lines 251-253).

      Also, maybe the authors could explore whether there is actually a link between the number of years without hand and the remapping effects.

      To address this question, we explored our data using a correlation analysis. The only body part who showed some suggestive remapping effects was the tongue, and so we explored whether we could find a relationship (Pearson’s correlation) between years since amputation and the laterality index of the Tongue in amputees (r = 0.007, p=0.980, 95% CI [-0.475, 0.475]). We also explored amputees’ global Jaccard similarity values to controls in the deprived hemisphere (r = -0.010, p=0.970, 95% CI [-0.488, 0.473]), and could not find any relationship. Considering there was no strong remapping effect to explain, we find this result too exploratory to include in our manuscript.

      One hypothesis generated by the data is that lips remap in the deprived hand area because lips serve compensatory functions. Actually, also in controls, lips and hands can be used to manipulate objects, in contrast to the forehead. One may thus wonder if the preferential presence of lips in the hand region is not latent even in controls as they both link in functions?

      We agree with the reviewer’s reasoning, and we think that the distributed representational content we recently found in two-handers (Muret et al, 2022) provides a first hint in this direction. It is worth noting that in that previous publication we did not find differences across face parts in the activity levels obtained in the hand region, except for slightly more negative values for the tongue. But we do think that such latent information is likely to provide a “scaffolding” for remapping. While the design of our face task does not allow to assess information content for each face part (as done for the lips in Muret et al, 2022), this should be further investigated in follow-up studies.

      We added a sentence in the discussion to highlight this interesting notion: Lines 556-559: “Together with the recent evidence that lip information content is already significant in the hand area of two-handed participants (Muret et al, 2022), compensatory behaviour since developmental stages might further uncover (and even potentiate) this underlying latent activity.”

    1. Author Response

      Reviewer #1 (Public Review):

      The authors used data from extracellular recordings in mouse piriform cortex (PCx) by Bolding & Franks (2018), they examined the strength, timing, and coherence of gamma oscillations with respiration in awake mice. During "spontaneous" activity (i.e. without odor or light stimulation), they observed a large peak in gamma that was driven by respiration and aligned with the spiking of FBIs. TeLC, which blocks synaptic output from principal cells onto other principal cells and FBIs, abolishes gamma. Beta oscillations are evoked while gamma oscillations are induced. Odors strongly affect beta in PCx but have minimal (duration but not amplitude) effects on gamma. Unlike gamma, strong, odor-evoked beta oscillations are observed in TeLC. Using PCA, the authors found a small subset of neurons that conveyed most of the information about the odor (winner cells). Loser cells were more phase-locked to gamma, which matched the time course of inhibition. Odor decoding accuracy closely follows the time course of gamma power.

      We thank the reviewer for the accurate summary of our work.

      I think this is an interesting study that uses a publicly available dataset to good effect and advances the field elegantly, especially by selectively analyzing activity in identified principal neurons versus inhibitory interneurons, and by making use of defined circuit perturbations to causally test some of their hypotheses.

      We thank the reviewer for the positive appraisal.

      Major:

      • The authors show odor-specificity at the time of the gamma peak and imply that the gamma coupling is important for odor coding. Is this because gamma oscillations are important or because gamma is strongest when activity in PCx is strongest (i.e. both excitatory and inhibitory activity, which would cancel each other in the population PSTH, which peaks earlier)? To make this claim, the authors could show that odor decoding accuracy - with a small (~10 ms sliding window) - oscillates at approx. gamma frequencies. As is, Fig. 5 just shows that cells respond at slightly different times in the sniff cycle. What time window was used for computing the Odor Specificity Index? Put another way, is it meaningful that decoding is most accurate when gamma oscillations are strongest, or is this just a reflection of total population activity, i.e., when activity is greatest there is more gamma power, and odor decoding accuracy is best?

      We thank the reviewer for the critical comment. Please note that the employed decoding strategy (supervised learning with cross-validation) prevents us from quantifying a time series of decoding accuracy. Nevertheless, to overcome this difficulty, we divided the spike data (0-500 ms following the inhalation start) according to the gamma cycle into four non-overlapping gamma phase bins. Then we tested whether odor decoding accuracy varied as a function of the gamma cycle phase. Using this approach, we found that decoding depended on the gamma phase, as shown below:

      (The bottom plot shows the modulation of decoding accuracy within the gamma cycle [Real MI] compared to a surrogate distribution [Surr MI, obtained by circularly shifting the gamma phases by a random amount]).

      We interpret this new result as indicative that gamma influences decoding accuracy directly and that our previous result was not only a reflection of total population activity. Moreover, please note that we only use the principal cell activity for computing the odor specificity index (Fig 5E) and decoding accuracy (Fig 7B). Both peak at ~150 ms following inhalation start, at a time window where the net principal cell activity is roughly similar to baseline levels (Fig 5A bottom panel).

      These new panels were added to revised Figure 7 and mentioned in the revised manuscript (page 8); we now also discuss the above considerations about maximal decoding not coinciding with the peak firing rate (page 10).

      Regarding the Odor Specificity Index computation, we apologize for not describing it appropriately in the corresponding Methods subsection. We employed the same sliding time window as in the population vector correlation and the decoding analyses (i.e., 100 ms window, 62.5 % overlap). This information has been added to the revised manuscript (page 15).

      • The authors say, "assembly recruitment would depend on excitatory-excitatory interactions among winner cells occurring simultaneously during gamma activity." Can the authors test this prediction by examining the TeLC recordings, in which excitatory-excitatory connections are abolished?

      We thank the reviewer for the relevant comment. We followed the reviewer's suggestion and analyzed odor assemblies in TeLC recordings. Interestingly, we found a greater increase in the firing rate of winner cells in TeLC recordings (see figure below), which therefore does not support our previous interpretation that assembly recruitment would depend on excitatory-excitatory local interactions.

      Thus, this new result suggests a much more critical role than we previously considered for the OB projections in determining winner neurons.

      Moreover, we found significant differences in the properties of loser cells. In particular, the TeLC-infected piriform cortex showed a decreased number of losing cells, which were significantly less inhibited than their contralateral counterparts:

      Furthermore, the reduced inhibition of losing cells was associated with an increased correlation of assembly weights across odors for the affected hemisphere:

      Therefore, we believe these results highlight the role of gamma oscillations in segregating cell assemblies and generating a sparse orthogonal odor representation in the piriform cortex. These findings are now included as new panels of Figure 6 and discussed on page 8. Noteworthy, to conform with them, we modified our speculative sentence (page 9) "assembly recruitment would depend on excitatory-excitatory interactions among winner cells occurring simultaneously during gamma activity" to “(…) the assembly recruitment would depend on OB projections determining which winner cells “escape” gamma inhibition, highlighting the relevance of the OB-PCx interplay for olfaction (Chae et al., 2022; Otazu et al., 2015).”

      • The authors show that gamma oscillations are abolished in the TeLC condition and use this to claim that gamma arises in the PCx. However, PCx neurons also project back to the OB, where they form excitatory connections onto granule cells. Fukunaga et al (2012) showed that granule cells are essential for generating gamma oscillations in the bulb. Can the authors be sure that gamma is generated in the PCx, per se, rather than generated in the bulb by centrifugal inputs from the PCx, and then inherited from the bulb by the PCx?

      We thank the reviewer for the pertinent comment regarding gamma generation in the PCx. To address this point, we have performed current source density (CSD) analysis, which showed sink and sources of low-gamma oscillations within the PCx and also a phase reversal:

      This result – shown as panel F in Figure 1 – suggests a local generation of gamma within the PCx. Along with the fact that PCx gamma tightly correlates with piriform FBI firing and that PCx gamma disappears in the TeLC ipsi hemisphere, which has intact OB projections, we deem it more parsimonious to assume that gamma does originate in the piriform circuit during feedback inhibition acting on principal cells and is not directly inherited from OB (though it depends on its drive). We have edited our text to incorporate the figure above panel (page 4). We now also relate our results with those of Fukunaga and colleagues for the OB gamma generation and discuss the alternative interpretation of inherited gamma (page 9).

      Reviewer #2 (Public Review):

      This is a very interesting paper, in which the authors describe how respiration-driven gamma oscillations in the piriform cortex are generated. Using a published data set, they find evidence for a feedback loop between local principal cells and feedback interneurons (FBIs) as the main driver of respiration-driven gamma. Interestingly, odour-evoked gamma bursts coincide with the emergence of neuronal assemblies that activate when a given odour is presented. The results argue in favour of a winner-take-all mechanism of assembly generation that has previously been suggested on theoretical grounds.

      We thank the reviewer for his/her work and accurate summary of our results.

      The article is well-written and the claims are justified by the data. Overall, the manuscript provides novel key insights into the generation of gamma oscillations and a potential link to the encoding of sensory input by cell assemblies. I have only minor suggestions for additional analyses that could further strengthen the manuscript:

      We thank the reviewer for the positive appraisal.

      1) The authors' analysis of firing rates of FFIs and FBIs combined with TeLC experiments make a compelling case for respiration-driven gamma being generated in a pyramidal cell-FBI feedback mechanism. This conclusion could be further strengthened by analyzing the gamma phase-coupling of the three neuronal populations investigated. One would expect strong coupling for FBIs but not FFIs (assuming that enough spikes of these populations could be sampled during the respiration-triggered gamma bursts). An additional analysis to strengthen this conclusion could be to extract FBI- and FFI spike-triggered gamma-filtered signals. One might expect an increase in gamma amplitude following FBI but not FFI spiking (see e.g., Pubmed ID 26890123).

      We thank the reviewer for the comment. To address this point, we first computed spike-coupling strength (by means of the Mean Vector Length – MVL) for each neuronal subtype. As shown below, we did not find major differences in MVL values across subtypes (if anything, the FBIs actually displayed the lowest MVL, though it should be cautioned that this metric is sensible to sample size, which differed among subtypes):

      Of note, this result also translated to spike-triggered gamma-filtered signals, with FBIs having the lowest average. We don’t however believe these findings speak against a major role of FBIs in giving rise to field gamma, since it is expected that inhibited neurons will highly phase-lock to gamma (while more active neurons during gamma would show lower phase-locking). Nevertheless, we also computed the spike-triggered gamma amplitude envelope for all three neuronal subtypes. This analysis showed that gamma envelopes closely followed FBI spikes (and not FFIs or EXC cells), and thus this new result reinforces the idea that FBIs trigger gamma oscillations. This plot is now part of an inset of Figure 1G (described on page 5).

      2) The authors utilize the neurons' weight in the first PC to assign them to odour-related assemblies. This method convincingly extracts an assembly for each odour (when odours are used individually), and these seem to be virtually non-overlapping. It would be informative to test whether a similar clear separation of the individual assemblies could be achieved by running the analysis on all odours simultaneously, perhaps by employing a procedure of assembly extraction that allows to deal with overlapping assembly membership better than a pure PCA approach (as used for instance in the work cited on page 11, including the authors' previous work)? I do not doubt the validity of the authors' approach here at all, but the suggested additional analysis might allow the authors to increase their confidence that individual neurons contribute mostly to an assembly related to a single odour.

      We thank the reviewer for the pertinent comment. In order to address it, we ran the ICA-based approach to detect cell assemblies (Lopes-dos-Santos et al., 2013) using the spike time series of all odors concatenated. The concatenation included time windows around the gamma peak (100-400 ms after inhalation start). We chose this window to prevent the ICA from picking temporal features of the response as different ICs instead of the spiking variations caused by the different odors. As a reference, we also calculated ICA for each odor independently during the gamma peak.

      We found that the results obtained from ICA computed using concatenated data from all odors show important resemblances to those from the single ICA per odor approach. For instance, we get similar sparsity and cell assembly membership (Figure 6-figure supplement 1A), orthogonality (Figure 6-figure supplement 1B), and odor specificity (Figure 6-figure supplement 1C) in the ICs loadings through both approaches. Noteworthy, the average absolute IC correlation between the six odors (computed separately) and the six first ICs (computed from the combined odor responses) were similar across animals and showed no significant differences (Figure 6-figure supplement 1C).

      We also directly tested odor selectivity and separation in the concatenated data approach by computing each odor’s mean assembly activity (i.e., “IC projection”). Regarding the former, we found that most assemblies coded for 1 or 2 odors (Figure 6-figure supplement 1D). Regarding the diversity of representations for the sampled neurons, we assessed odor separation by examining to which odor each IC is activated the most. Under this framework, we get that, on average, the first 6 ICs encode three to five different odors (Figure 6-figure supplement 1E).

      We have included this result as a new Figure 6-figure supplement 1 and mention it on page 8. Of note, we have also performed all of our previous assembly analyses (i.e., Figure 6) using ICA instead of PCA to be consistent throughout the manuscript and allow the reader to compare with the new supplementary figure. This led to a new and enhanced version of Figure 6.

      3) Do the authors observe a slow drift in assembly membership as predicted from previous work showing slowly changing odour responses of principal neurons (Schoonover et al., 2021)? This could perhaps be quantified by looking at the expression strengths of assemblies at individual odour presentations or by running the PCA separately on the first and last third of the odour presentations to test whether the same neurons are still 'winners'.

      We thank the reviewer for calling our attention to this point. We note, however, that the representation drift observed by Schoonover et al. occurred along several days of recordings, i.e., at a much slower time scale than the single-day recordings we analyzed here (of note, Schoonover et al. observed no drift within the same day [their Fig 2a]). But irrespective of this, we believe that the data at hand does not allow for a confident analysis of possible drifts. This is because each odor was only presented ~12 times; so, further subdividing the data into subsets of only 4 trials would not render a reliable analysis, unfortunately.

      4) Does the winner-take-all scenario involve the recruitment of specific sets of FBIs during the activation of the individual odour-selective assemblies? The authors could address this by testing whether the rate of FBIs changes differently with the activation of the extracted assemblies.

      Within each recording session, the number of recorded FBIs is very low, on average 3.6 FBIs per recording session. Thus, unfortunately such interesting analysis cannot be confidently performed.

      5) Given the dependence on local gamma oscillations, one might expect that odour-selective assemblies do not emerge in the TeLC-expressing hemisphere. This could be directly tested in the existing data set.

      We are thankful for the comment. We followed the reviewer's suggestion and analyzed odor assemblies in TeLC recordings, comparing the ipsilateral hemisphere (infected) with the contralateral one. Interestingly, we find an increased correlation of assembly weights across odors, suggesting that the formation/segregation of odor-selective assemblies is hindered when the principal cell synapses are abolished. This assembly selectivity reduction co-occurred as the number of losing neurons decreased, and the inhibition of the latter was also reduced. Consequently, decoding accuracy significantly decreased during the 150-250 ms window in the infected TeLC hemisphere compared to the contralateral cortex.

      Therefore, we believe these new results support the role of gamma oscillations in segregating cell assemblies and generating a sparse orthogonal odor representation. These findings are now included as new panels of Figure 6 and Figure 7 and discussed on page 8.

    1. Author Response

      Reviewer #1 (Public Review):

      By studying the effect of Treg depletion in a CD8+ T cell-dependent diabetes model the group around Ondrej Stepanek described that in the absence of Treg cells antigen-specific CD8+ OT-I T cells show an activated phenotype and accelerate the development of diabetes in mice. These cells - termed KILR cells - express CD8+ effector and NK cell gene signatures and are identified as CD49d- KLRK1+ CD127+ CD8+ T cells. The authors suggest that the generation of these cells is dependent on TCR stimulation and IL-2 signals, either provided due to the absence of Treg cells or by injection of IL-2 complexed to specific antiIL-2 mAbs. In vivo, these cells show improved target cell killing properties, while the authors report improved anti-tumor responses of combination treatments with doxorubicin combined with IL-2/JES6 complexes. Finally, the authors identified a similar human subset in publicly available scRNAseq datasets, supporting the translational potential of their findings.

      The conclusions are mostly well supported, except for the following two considerations:

      We are happy for the positive overall evaluation of our manuscript by both reviewers and we are thankful for their specific insightful comments, which helped us to improve the manuscript.

      1) From Fig. 4A and B it is not conclusively shown, that Tregs limit IL-2 necessary for the expansion of OT-I cells and subsequent induction of diabetes. An IL-2 depletion experiment (e.g. with combined injection of the S4B6 and JES6-1 antibodies) would further strengthen this claim. Along these lines, the authors claim "IL-2Rα expression on T cells can be induced by antigen stimulation or by IL-2 itself in a positive feedback loop [20]. Accordingly, downregulation of IL-2Rα in OT-I T cells in the presence of Tregs might be a consequence of the limited availability of IL-2.". The cited reference 20 did observe CD25 upregulation by IL-2 on T cells but the observed effect might only be caused by upregulation of CD25 on Treg cells, which increases the MFI for the whole T cell population. Did the authors observe significant upregulation of CD25 on effector CD4+ and CD8+ T cells in their experiments with IL-2/S4B6 or IL-2/JES6 treatment?

      We added another reference to support our claim (Sereti, I., et al., Clin Immunol, 2000. 97(3): p. 266-76.). Along this line, we also observed that addition of IL-2 in vitro leads to IL-2Rα upregulation on CD8+ T cells (shown in Fig. 4C), which was IL-2Rα level was lower if Tregs were present. We also observed upregulation of IL-2Rα in vivo upon the stimulation of OT-I T cells with OVA and IL-2ic, which is now shown in the Fig. S6C of the revised manuscript.

      To further explore if Tregs limit expansion of OT-I and diabetes progression via IL-2 limitations, we performed the proposed experiment using a combined injection of S4B6 and JES6-1 anti-IL-2 antibodies. At the beginning, we were skeptical that we could completely block the IL-2 using this approach for the following reasons. First, IL-2 is produced locally in the spleen and lymph nodes and might not be easily accessible for the antibodies for a complete block. Second, IL-2 has a relatively short turnover and is continuously produced, but the half-life of the injected antibodies is unknown, which questions the duration of such a block. Third, it is possible that some IL-2 molecules would bound only to one of the two antibodies, which will make it a hyper-stimulating immune-complex, instead of neutralizing it.

      Anyway, we were curious enough to perform this experiment. We used a condition that based on our experience leads to diabetes manifestation in Tregs depleted, but not in Treg replete mice (10 k OT-I T cells, OVA + LPS immunization). One additional group of Treg-depleted mice received a single dose of S4B6 and JES6-1 anti-IL-2 (200 µg of each antibody per mouse). We observed that this IL-2 blocking delayed, but not prevented the development of diabetes in most animals (Fig. 1 below).

      Overall, we believe that this experiment is rather supporting our conclusions concerning the importance of IL-2, although the effect is only partial. However, we decided not to include this experiment in the manuscript, because we do not have the evidence about how efficient the IL-2 blocking was (see above), which makes the interpretation difficult. Because the reviews and the point-by-point response is public in eLife, we believe that showing the data here is appropriate.

      Figure 1. Role of IL-2 blocking on the development of experimental diabetes. Two independent experiments were performed. Statistical significance was calculated using Log-rank (Mantel-Cox) test for survival, and Kruskal-Wallis test for blood glucose (p-value is shown in italics).

      2) The anti-tumor efficacy of KILR cells is intriguing but currently, it is unclear if it is indeed mediated by KILR cells. Have KILR cells been identified by flow cytometry in the BCL1 and B16F10 models treated with doxorubicin and IL-2/JES6? Were specific KILR cell depletion studies conducted, e.g. with an anti-KLRK1 depleting antibody? Additional experiments addressing these questions would be desirable to further support the authors' claims.

      We are thankful to both reviewers for their similar comments concerning the analysis of CD8+ T cells in the tumor model. Addressing these comments lead to very useful data and significantly improved our manuscript.

      We performed the analysis of splenic CD8+ T cells in the BCL1 leukemia model (spleen is the major site of the leukemic cells in this model). We observed that KLRK1+ T cells represented almost half of CD8+ T cells in mice treated with DOX+IL-2, which was much higher frequency than in the control and DOX-only treated mice. Although not all KLRK1+ cells were bona fide KILR cells, the frequencies of KLRK1+ IL-7R+ and KLRK1+ CD49d- cells were also strongly elevated in the Dox+IL-2ic treated mice. Overall, the survival of DOX+IL-2ic treated mice correlated with the frequencies of KILR T cells and KLRK1+ T cells. Moreover, GZMB was almost exclusively expressed by KLRK1+ T cells. We are showing these data in Fig. 7C and Fig. S7B in the revised manuscript.

      In the B16 melanoma model, we analyzed CD8+ T cells in the spleens and also in the tumors. We observed a huge population of KLRK1+ GZMB+ CD8+ T-cell population in the spleen of DOX+IL-2ic-treated mice, but not in the untreated or DOX-only treated mice (Fig. 7F). Both KLRK1+ CD49d+ and KLRK1+ CD49d- CD8+ T cells were substantially more frequent in the DOX+IL-2ic-treated, but not in the untreated or DOX-only treated mice (Fig. S7F). In the tumor, the KLRK1+ CD49d- CD8+ T cells were found at large numbers only in the DOX+IL-2ic-treated mice (Fig. 7G). Moreover, these KLRK1+ CD49d- CD8+ T cells expressed high levels of IL-7R and GZMB only in DOX+IL-2ic-treated, but not in untreated and DOX-only treated mice (Fig. 7H).

      We believe that these new data provide evidence that the combination of immunogenic chemotherapy with IL-2 treatment induced KILR cells in the spleens and in the tumors and that this correlates with the better survival.

      Because the majority of non-naïve CD8+ T cells (and vast majority of GZMB+ CD8+ T cells) in the spleens and tumors of the tumor-bearing mice treated with DOX+IL-2ic were KLRK1+ and because we have shown that the protective effect of the DOX+IL-2ic therapy is largely CD8+ T cell-dependent, we did not find it essential to perform the depletion of KLRK1+ T-cells. We believe that it is almost inevitable that the depletion of KLRK1+ T cells would lead to increased tumor growth as it would probably deplete the majority of antigenspecific CD8+ T cells, mimicking the overall CD8+ T cell depletion. Moreover, we do not have this protocol established.

      Reviewer #2 (Public Review):

      In this study, the authors determine the superior cell killing abilities of KLRK1+ IL7R+ (KILR) CD8+ effector T cells in experimental diabetes and tumor mouse model. They also provide evidence that Tregs suppress the formation of this previously uncharacterized subset of CD8+ effector T cells by limiting IL-2.

      Strength and Limitation

      This study focuses on the relationship between Tregs and CD8+ T cells. They used different experimental diabetes mouse models to reveal that Tregs suppress the CD8+ effector T cells by limiting IL-2. They also found a unique subset of KLRK1+ IL7R+ (KILR) CD8+ effector T cells with superior cell killing abilities through single-cell sequencing, but killing abilities could be inhibited by Tregs. They also tested their theory in in vivo tumor model. The data, in general, support the conclusions; however, some issues need to be fully addressed, as detailed below.

      We are happy for the positive overall evaluation of our manuscript by both reviewers and we are thankful for their specific insightful comments, which helped us to improve the manuscript.

      1) This study used the concentration of urine glucose as the standard for diabetes ({greater than or equal to} 1000 mg/dl for two consecutive days). However, multiple reasons may lead to a high level of urine glucose. As a type I diabetes mouse model, authors could use immunohistological analysis of islet to show the proportion of T cells and islet cells in islet, which can display the geographic distribution of immune cells, severity and histology structure of damaged pancreas islet directly. If possible, different subsets of immune cells, especially CD4 vs CD8+ cells should be stained for their location.

      We added the histological examination of the pancreas in control, DEREG-, and DEREG+ mice using contrast H&E staining and immuno-fluorescence (Fig. 1D-E in the revised manuscript). We observed that the high glucose and blood levels are preceded by the destruction of the pancreatic islets (morphology and decreased insulin production) as well as by the infiltration of the islets with immune cells including CD4+ and CD8+ T cells.

      2) This article shows that KILR effector CD8+ T cells have strong cytotoxic properties. However, they do not describe the potential proliferation ability vs apoptosis of this subset from islets.

      We analyzed the proliferation (KI67 expression) and apoptosis (Annexin V, cleaved Caspase 3) in T cells isolated from the pancreas of DEREG- and DEREG+ mice on day 4 after the induction of diabetes using flow cytometry (Figure 2 below). We did not observe any differences between DEREG- and DEREG+ mice or among different subsets of OT-I T cells in the DEREG+ mice. Essentially, all T cells were proliferative (KI67+) and there was a very low percentage of Annexin V or cleaved Caspase 3 positive cells.

      Figure 2. Lymphocytes were isolated from the pancreas of DEREG- RIP.OVA and DEREG+ RIP.OVA mice on day 4 after the induction of diabetes, and analyzed using flow cytometry. Two independent experiments were performed. Gated on OT-I T cells. Top: proliferation rate based on Ki-67 staining. Representative histogram and MFI (median is shown). Middle: Apoptosis rate based on Annexin V staining. Representative histogram shows Annexin V staining in three populations of OT-I T cells from DEREG+ mouse (“AE” - CD49d+ KLRK1-, “++” - CD49d+ KLRK1+, KILR - CD49d- KLRK1+), total OT-I T cells from DEREG-, and a positive control: WT CD8+ T cells treated with hydrogen peroxide. Middle right: Percentage of Annexin V+ cells and MFI (median is shown). Bottom: Apoptosis rate based on cleaved Caspase 3 staining. Representative dot plots show cleaved Caspase 3 staining of OT-I T cells from DEREG+, DEREG-, and a positive control: WT CD8+ T cells treated with hydrogen peroxide. Bottom right: percentage of cleaved Caspase 3+ cells (median is shown).

      However, we found question concerning proliferation and apoptosis of KILR cells interesting and worth further investigation. For this reason, we assessed the proliferation, survival, and phenotypic stability of naïve, KILR, and effector T cells by their competitive transfer into CD3ε-/- mice. The phenotype of all these three subsets remained stable for 4 days (Fig. 6F), documenting that KILR cells are not just a very transient stage. Moreover, the KILR cells were ~2 fold more abundant then effector cells 3 days after their 1:1 cotransfer into CD3ε-/- mice (Fig. 6G, Fig. 6SE). This was probably caused by their slight advantages in both proliferation and survival (Fig. 6SF-G).

      3) Figure 7 shows that the antitumor efficacy of IL-2 depends on CD8+ T cells. But in this part, there is no data to show the change of KLRK1+ IL7R+ CD8+ effector T cells in tumor tissue. Therefore, the article needs to add more data to verify that IL-2 enhances antitumor ability via KLRK1+ IL7R+ CD8+ effector T cells.

      We are thankful to both reviewers for their similar comments concerning the analysis of CD8+ T cells in the tumor model. Addressing these comments lead to very useful data and significantly improved our manuscript.

      We performed the analysis of splenic CD8+ T cells in the BCL1 leukemia model (spleen is the major site of the leukemic cells in this model). We observed that KLRK1+ T cells represented almost half of CD8+ T cells in mice treated with DOX+IL-2, which was much higher frequency than in the control and DOX-only treated mice. Although not all KLRK1+ cells were bona fide KILR cells, the frequencies of KLRK1+ IL-7R+ and KLRK1+ CD49d- cells were also strongly elevated in the Dox+IL-2ic treated mice. Overall, the survival of DOX+IL-2ic treated mice correlated with the frequencies of KILR T cells and KLRK1+ T cells. Moreover, GZMB was almost exclusively expressed by KLRK1+ T cells. We are showing these data in Fig. 7C and Fig. S7B in the revised manuscript.

      In the B16 melanoma model, we analyzed CD8+ T cells in the spleens and also in the tumors. We observed a huge population of KLRK1+ GZMB+ CD8+ T-cell population in the spleen of DOX+IL-2ic-treated mice, but not in the untreated or DOX-only treated mice (Fig. 7F). Both KLRK1+ CD49d+ and KLRK1+ CD49d- CD8+ T cells were substantially more frequent in the DOX+IL-2ic-treated, but not in the untreated or DOX-only treated mice (Fig. S7F). In the tumor, the KLRK1+ CD49d- CD8+ T cells were found at large numbers only in the DOX+IL-2ic-treated mice (Fig. 7G). Moreover, these KLRK1+ CD49d- CD8+ T cells expressed high levels of IL-7R and GZMB only in DOX+IL-2ic-treated, but not in untreated and DOX-only treated mice (Fig. 7H).

      We believe that these new data provide evidence that the combination of immunogenic chemotherapy with IL-2 treatment induced KILR cells in the spleens and in the tumors and that this correlates with the better survival.

      4) It is unclear why the authors chose Dox to combine with IL-2/JES6. The authors should provide a more rational introduction to bridge such a combination. Authors should also explain the reason why there is no antitumor effect of IL-2/JES6 treatment alone.

      The experiments with OT-I mice showed that the formation of KILR cells required both the antigenic stimulation and IL-2 signals. We believe that there is only very week antigenic stimulation by the tumor itself. For this reason, we combined the treatment with the chemotherapy Doxorubicin, which is known to induce immunogenic cell death of the tumor cells (e.g., Casares et al. 2005, PMID: 16365148). We believe that doxorubicin induces the death of (some) tumor cells and the release and presentation of their tumorspecific antigens. Without it, the tumor are simply too “cold” to induce sufficient T-cell response. We emphasized this in the revised version of the manuscript.

      Importantly, some of us observed a similar effect of IL-2ic in a combination with check-point blockade therapy (without chemotherapy) in a different tumor model, which documents that the chemotherapy is not essential for this effect (unpublished data).

    1. Author Response

      Reviewer #1 (Public Review):

      Point 1: Many of the initial analyses of behavior metrics, for instance predicting reaction times, number of fixations, or fixation duration, use value difference as a regressor. However, given a limited set of values, value differences are highly correlated with the option values themselves, as well as the chosen value. For instance, in this task the only time when there will be a value difference of 4 drops is when the options are 1 and 5 drops, and given the high performance of these monkeys, this means the chosen value will overwhelmingly be 5 drops. Likewise, there are only two combinations that can yield a value difference of 3 (5 vs. 2 and 4 vs 1), and each will have relatively high chosen values. Given that value motivates behavior and attracts attention, it may be that some of the putative effects of choice difficulty are actually driven by value.

      To address this question, we have adapted the methods of Balewski and colleagues (Neuron, 2022) to isolate the unique contributions of chosen value and trial difficulty to reaction time and the number of fixations in a given trial (the two behaviors modulated by difficulty in the original paper). This new analysis reveals a double dissociation in which reaction time decreases as a function of chosen value but not difficulty, while the number of fixations in a trial shows the opposite pattern. Our interpretation is that reaction time largely reflects reward anticipation, whereas the number of fixations largely reflects the amount of information required to render a decision (i.e., choice difficulty). See lines 144-167 and Figure 2.

      Point 2: Related to point 1, the study found that duration of first fixations increased with fixated values, and second (middle) fixation durations decreased with fixated value but increased with relative value of the fixated versus other value. Can this effect be more concisely described as an effect of the value of the first fixated option carrying over into behavior during the second fixation?

      This is a valid interpretation of the results. To test this directly, we now include an analysis of middle fixation duration as a function of the not-currentlyviewed target. Note that the vast majority of middle fixations are the second fixation in the trial, and therefore the value of the unattended target is typically the one that was viewed first. The analysis showed a negative correlation between middle fixation duration and the value of the unattended target which is consistent with the first fixated value carrying over to the second fixation. See lines 243-246.

      Point 3: Given that chosen (and therefore anticipated) values can motivate responses, often measured as faster reaction times or more vigorous motor movements, it seems curious that terminal non-decision times were calculated as a single value for all trials. Shouldn't this vary depending at least on chosen values, and perhaps other variables in the trial?

      In all sequential sampling model formulations we are aware of, nondecision time is considered to be fixed across trial types. Examples can be found for perceptual decisions (e.g., Resulaj et al., 2009) and in the “bifurcation point” approach used in the recent value-based decision study by Westbrook et al. (2020).

      To further investigate this issue, we asked whether other post-decision processes were sensitive to chosen value in our paradigm. To do so, we measured the interval between the center lever lift and the left or right lever press, corresponding to the time taken to perform the reach movement in each trial (reach latency). We then fit a mixed effects model explaining reach latency as a function of chosen value. While the results showed significantly faster reach latencies with higher chosen values, the effect size was very small, showing on average a ~3ms decrease per drop of juice. In other words, between the highest and lowest levels of chosen value (5 vs. 1), there is only a difference of approximately 12ms. In contrast, the main RT measure used in the study (the interval between target onset and center lever lift) is an order of magnitude more sensitive to chosen value, decreasing ~40ms per drop of juice. These results are shown in Author response image 1.

      Author response image 1.

      This suggests that post-decision processes (NDT in standard models and the additive stage in the Westbrook paper) vary only minimally as a function of chosen value. We are happy to include this analysis as a supplemental figure upon request.

      Point 4: The paper aims to demonstrate similarities between monkey and human gaze behavior in value-based decisions, but focuses mainly on a series of results from one group of collaborators (Krajbich, Rangel and colleagues). Other labs have shown additional nuance that the present data could potentially speak to. First, Cavanaugh et al. (J Exp Psychol Gen, 2014) found that gaze allocation and value differences between options independently influence drift rates on different choices. Second, gaze can correlate with choice because attention to an option amplifies its value (or enhances the accumulation of value evidence) or because chosen options are attended more after the choice is implicitly determined but not yet registered. Westbrook et al. (Science, 2020) found that these effects can be dissociated, with attention influencing choice early in the trial and choice influencing attention later. The NDTs calculated in the present study allot a consistent time to translating a choice into a motor command, but as noted above don't account for potential influences of choice or value on gaze.

      The two-stage model of gaze effects put forth by Westbrook et al. (2020) is consistent with other observations of gaze behavior and choice (i.e., Thomas et al., 2019, Smith et al., 2018, Manohar & Husain, 2013). In this model, gaze effects early in the trial are best described by a multiplicative relationship between gaze and value, whereas gaze effects later in the trial are best described with an additive model term. To test the two-stage hypothesis, Westbrook and colleagues determined a ‘bifurcation point’ for each subject that represented the time at which gaze effects transitioned from multiplicative to additive. In our data, trial durations were typically very short (<1s), making it difficult to divide trials and fit separate models to them. We therefore took at different approach: We reasoned that if gaze effects transition from multiplicative to additive at the end of the trial, then the transition point could be estimated by removing data from the end of each trial and assessing the relative fit of a multiplicative vs. additive model. If the early gaze effects are predominantly multiplicative and late gaze effects are additive, the relative goodness of fit for an additive model should decrease as more data are removed from the end of the trial. To test this idea, we compared the relative model fit of an additive vs. multiplicative models in the raw data, and for data in which successively larger epochs were removed from the end of the trial (50, 100, 150, 200, 300, and 400ms). The relative fit was assessed by computing the relative probability that each model accurately reflects the data. In addition, to identify significant differences in goodness of fit, we compared the WAIC values and their standard errors for each model (Supplemental File 3). As shown in Figure 4, the relative fit probability for both models is nonzero in the raw data 0 truncation), indicating that a neither model provides a definitive best fit, potentially reflecting a mixture of the two processes. However, the relative fit of the additive model decreases sharply as data is removed, reaching zero at 100ms truncation. 100ms is also the point at which multiplicative models provide a significantly better fit, indicated by non-overlapping standard error intervals for the two models (Supplemental File 3). Together, this suggested that the transition between early- and late-stage gaze effects likely occurs approximately 100ms before the RT.

      To minimize the influence of post-decision gaze effects, the main results use data truncated by 100ms. However, because 100ms is only an estimate, we repeated the main analyses over truncation values between 0 and 400ms, reported in Figure 6 - figure supplement 1 & Figure 7 - figure supplement 1. These show significant gaze duration biases and final gaze biases in data truncated by up to 200ms.

      Reviewer #2 (Public Review):

      Recommendation 1: The only real issue that I see with the paper is fairly obvious: the authors find that the last fixations are longer than the rest, which is inconsistent with a lot of the human work. They argue that this is due to the reaching required in this task, and they take a somewhat ad-hoc approach to trying to correct for it. Specifically, they take the difference between final and non-final, second fixations, and then choose the 95th percentile of that distribution as the amount of time to subtract from the end of each trial. This amounts to about 200 ms being removed from the end of each trial. There are several issues with this approach. First, it assumes that final and non-final fixations should be the same length, when we know from other work that final fixations are generally shorter. Second, it seems to assume that this 200ms is "the latency between the time that the subject commits to the movement and the time that the movement is actually detected by the experimenter". However, there is a mismatch between that explanation and the details of the task. Those last 200ms are before the monkey releases the middle lever, not before the monkey makes a left/right choice. When the monkey releases the middle lever, the stimuli disappear and they then have 500ms to press the left or right lever. But, the reaction time and fixation data terminate when the monkey releases the middle lever. Consequently, I don't find it very likely that the monkeys are using those last 200ms to plan their hand movement after releasing the middle lever.

      Thanks for the opportunity to clarify these points. There are three related issues:

      First, with regards to fixation durations, in the updated Figure 3 we now show durations as a function of both the absolute order in the trial (first, second, third, fourth, etc.) and the relative order (final/nonfinal). We find that durations decrease as a function of absolute order in the trial, an effect also seen in humans (see Manohar & Husain, 2013). At the same time, while holding absolute order constant, final fixations are longer than non-final fixations. To explain the discrepancy with human final fixation durations, we note that monkeys make many fewer fixations per trial (~2.5) than humans do (~3.7, computed from publicly available data from Krajbich et al., 2010.) This means that compared to humans, monkeys’ final fixations occur earlier in the trial (e.g., second or third), and are therefore comparatively longer in duration. Note that studies with humans have not independently measured fixation durations by absolute and relative order, and therefore would not have detected the potential interaction between the two effects.

      Second, the comment suggests that the final 200ms before lever lift is not spent planning the left/right movement, given that the monkeys have time after the lever lift in which to execute the movement (400 or 500ms, depending on the monkey). The presumption appears to be that 400/500ms should be sufficient to plan a left/right reach. However, we think that these two suggestions are unlikely, and that our original interpretation is the most plausible. First, the 400/500ms deadline between lift and left/right press was set to encourage the monkeys to complete the reach as fast as possible, to minimize deliberations or changes of mind after lifting the lever. More specifically, these deadlines were designed so that on ~0.5% of trials, the monkeys actually fail to complete the reach within the deadline and fail to obtain a reward. This manipulation was effective at motivating fast reaches, as the average reach latency (time between lift and press) was 165 SEM 20ms for Monkey K, and 290 SEM 100ms for Monkey C.

      Therefore, given the time pressure imposed by the task, it is very unlikely that significant reach planning occurs after the lever lift. In addition to these empirical considerations, the idea that the final moments before the RT are used for motor planning is a standard assumption in many theoretical models of choice (including sequential sampling models, see Ratcliff & McKoon 2008, for review), and is also well-supported by studies of motor control and motor system neurophysiology. Based on these, we think the assumption of some form of terminal NDT is warranted.

      Third, we have changed our method for estimating the NDT interval. In brief we sweep through a range of NDT truncation values (0-400ms) and identify the smallest interval (100ms) that minimizes the contribution of “additive” gaze effects, which are thought to reflect late-stage, post-decision gaze processes. See the response to Point 4 for Reviewer 1 above, Figure 4 and lines 267-325 in the main text. In addition, we report all of the major study results over a range of truncation values between 0 and 400ms.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper describes the neural activity, measured by intrinsic optical imaging in reach-to-grasp, and reach-only conditions in relation to the Intra-cortical micro stimulation maps. The paper mostly describes a relatively unique and potentially useful data set. However, in the current version, no real hypotheses about the organization of M1 and PMd are tested convincingly. For example, the claim of "clustered neural activity" is not tested against any quantifiable alternative hypothesis of non-clustered activity, and support for this idea is therefore incomplete.

      The combination of intrinsic optical imaging and intra-cortical micro-stimulation of the motor system of two macaque monkeys promised to be a unique and highly interesting dataset. The experiments are carefully conducted. In the analysis and interpretation of the results, however, the paper was disappointing to me. The two main weaknesses in my mind were:

      a) The alternative hypotheses depicted in Figure 1B are not subjected to any quantifiable test. When is an activity considered to be clustered and when is it distributed? The fact that the observed actions only activate a small portion of the forelimb area (Figure 5G, H) is utterly unconvincing, as this analysis is highly threshold-dependent. Furthermore, it could be the case that the non-activated regions simply do not give a good intrinsic signal, as they are close to microvasculature (something that you actually seem to argue in Figure 6b). Until the authors can show that the other parts of the forelimb area are clearly activated for other forelimb actions (as you suggest on line 625), I believe the claim of cluster neural activity stands unsupported.

      We appreciate the reviewer’s concerns and we have made several revisions.

      (1) The two panels in Fig 1B should have been presented as potential outcomes as opposed to hypotheses in need of quantifiable testing. We revised the Introduction (line 105-111) and the Results (line 149-152) accordingly.

      (2) We agree that the thresholding procedure adopted in the original submission could have impacted the spatial measurements of cortical activity (i.e., Fig 5G-H in original submission). We have completely revised the thresholding procedure and it is now based on statistical comparisons that include all trials (instead of thresholding by number of sessions in the original submission). Thus, the thresholded maps in Fig 5G & 5J are now obtained from pixel-by-pixel comparisons (t-tests, p<1e-4) between frames acquired post-movement and frames acquired before movement. Nevertheless, even with this relatively relaxed threshold, the largest activity maps overlapped <40% of the forelimb representations.

      It is important to note that major vessels were excluded from the thresholded map and from the motor map. Thus, uncertainty about imaging in and around vessels was likely not a factor in the calculated overlap between thresholded maps and the motor map.

      (3) We agree that showing activation in other parts of the forelimb representations in response to action other than reach-to-grasp would have supported some of the arguments that we previously put forth. Unfortunately, we do not have the supporting data and obtaining it would take months/years. We have therefore expanded the Discussion to include limitations of the behavioral task (line 439-443).

      b) The most interesting part of the study (which cannot be easily replicated with human fMRI studies) is the correspondence between the evoked activity and intra-cortical stimulation maps. However, this is impeded by the subjective and low-dimensional description of the evoked movement during stimulation (mainly classifying the moving body part), and the relatively low-dimensional nature (4 conditions) of the evoked activity.

      We agree with the reviewer on all accounts. We expanded the Discussion to consider the low dimensionality of the motor maps and the behavioral task (line 439-449).

      Measuring cortical activity in a variety of motor tasks would likely have provided additional insight about movement-related cortical activity. Nevertheless, including additional tasks, even if it were possible to do so in the same monkeys, would have delayed study completion by months/years. The hidden challenge of the experimental design is that each monkey is trained to not move for many seconds to minimize contamination of ISOI signals. For example, from trial initiation to Go Cue, the monkey must hold its hand in the start position for 5 seconds. Similarly, after movement completion, the monkey must hold its hand in the start position for another 5 seconds. In between successful trials, a monkey must wait for ~12 seconds before it can initiate a new trial. These durations are >1 order of magnitude longer than in electrophysiological studies in comparable tasks. Achieving consistent task performance with the long durations used here, took months of daily training. Moreover, our monkeys typically run out of steam after ~60-70 min of working on the task. This forces us to limit the overall number of task conditions tested in a session, to obtain a large enough number of trials from each condition.

      c) Many details about the statistical analysis remain unclear and seem not well motivated.

      We address the reviewer’s specific concerns.

      Reviewer #2 (Public Review):

      Chehade and Gharbawie investigated motor and premotor cortex in macaque monkeys performing grasping and reaching tasks. They used intrinsic signal optical imaging (ISOI) covering an exceedingly large field-of-view extending from the IPS to the PS. They compared reaching and fine/power-grip grasping ISOI maps with "motor" maps which they obtained using extensive intracranial microstimulation. The grasping/reaching-induced activity activated relatively isolated portions of M1 and PMd, and did not cover the entire ICM-induced 'motor' maps of the upper limbs. The authors suggest that small subzones exist in M1 and PMd that are preferentially activated by different types of forelimb actions. In general, the authors address an important topic. The results are not only highly relevant for increasing our basic understanding of the functional architecture of the motor-premotor cortex and how it represents different types of forelimb actions, but also for the development of brain-machine interfaces. These are challenging experiments to perform and add to the existing yet complementary electrophysiology, fMRI, and optical imaging experiments that have been performed on this topic - due to the high sensitivity and large coverage of the particular IOSI methods employed by the authors. The manuscript is generally well written and the analyses seem overall adequate - but see below for some additional analyses that should be done. Although I'm generally enthusiastic about this manuscript, there are two major issues that should be clarified. These major questions relate mainly to potential thresholding issues and clustering issues.

      Major:

      1) The main claim of the authors is that specific forelimb actions activate only a small fraction of what they call the motor map (i.e., those parts of M1/PMd that evoke muscle contractions upon ICM). The action-related activity is measured by ISOI. When looking a the 'raw' reflectance maps, it is rather clear that relatively wide portions of the exposed cortex are activated by grasping/reaching, especially at later time points after the action. In fact, another reading of the results may be that there are two zones of 'deactivation' that split a large swath of motor-premotor cortex being activated by the grasping/reaching actions. (e.g. at 6 seconds after the cue in Fig 3A, 5A). At first sight, the 'deactivated' regions seem to be located in the cortex representing the trunk/shoulder/face - hence regions not necessarily activated (or only weakly) during the grasping/reaching actions. If true, this means that most of the relevant M1/PMd cortex IS activated during the latter actions - opposing the 'clustering' claims of the authors. This raises the question of whether the 'granularity' claimed by the authors is

      a. threshold dependent. In this context, the authors should provide an analysis whereby 'granularity' is shown independent of statistical thresholds of the ISOI maps.

      We appreciate the reviewer’s concerns and have completely revised the analyses central to Fig 5. We believe that the figure now contains evidence from both thresholded and unthresholded ISOI data in support of limited spatial extent of cortical activation (i.e., “granularity” in the reviewer’s comments).

      For evidence from unthresholded ISOI data, we examined reflectance change time courses from different size ROIs (line 764-768). (A) Small circular ROIs (0.4 mm radius), which we placed in the M1 hand, M1 arm, and PMd arm, zones (Fig 5B). (B) Large ROI inclusive of the M1 and PMd forelimb representations (Fig 5B). We reasoned that if cortical activity is spatially widespread, then the small and large ROIs would report similar time courses. In contrast, if cortical activity is spatially focal, then activity would be detected in the small ROI time courses but would washed out in the large ROI time courses. Our results support the second possibility (Fig 5C-F). Thus, in the movement conditions, time courses from the small ROIs had a large negative peak after movement completion (Fig C-E). In contrast, the characteristic negative peak was absent in the time courses obtained from the large ROI (Fig 5F).

      Separately, we revised our thresholding approach to make those results less sensitive to thresholding effects (more details in our response to the first major point from Reviewer 1). The revised results – thresholded/ binarized maps – are consistent with focal cortical activity. Fig 5G & 5J show activity maps thresholded (t-test, p<0.0001) without correction for multiple comparisons, and therefore represent the least restrictive estimate of the spatial extent of cortical activity. Measurements from these maps showed that significantly active pixels overlapped <40% of the M1 & PMd forelimb representations. We interpret the thresholded results as evidence in support of focal cortical activity.

      This raises the question of whether the 'granularity' claimed by the authors is

      b. dependent on the time-point one assesses the maps. Given the sluggish hemodynamic responses, it is unclear which part of the ISOI maps conveys the most information relative to the cue and arm/hand movements. I suspect that timepoints > 6 s will reveal even larger 'homogeneous' activations compared to the maps < 6s.

      We agree with the reviewer that the lag in hemodynamic signals complicates frame selection. Nevertheless, it is unlikely that cortical activity maps would have been larger at time points >6s from Cue. We provide three supporting arguments.

      (1) In the imaging sessions used in Fig 4, we acquired images for 9s per trial and systematically varied Cue onset time. The time courses in Fig 4A-B show that for all Cue onset conditions, the negative peak occurred <6s from Cue. This observation from unthresholded results does not support the notion of greater cortical activity at time points >6s from Cue.

      (2) From the same experiment, Fig 4C shows 9 thresholded/binarized maps generated from different time points in relation to Cue. We measured the size of each map (i.e., overlap with the M1/PMd forelimb representations). We present the results in Author response image 1. The largest maps came from an average frame captured +5.8-6.0s from Cue. Those maps are on the diagonal in Fig 4E (top left to bottom right). This result from thresholded data therefore does not support the notion of greater cortical activity at time points >6s from Cue.

      Author response image 1.

      (3) In all other sessions, we acquired images for 7s per trial (-1.0 to +6.0 s from Cue) without varying Cue onset time. At every time point (100 ms), we measured the size of the thresholded/binarized map in relation to the size of the M1 and PMd forelimb representations. The results are presented in Fig 5I & 5L and indicate that thresholded maps plateau in size by 5.0-5.5 s from Cue. At peak size, the maps overlapped <50% of the M1 and PMd forelimb representations. These result indicates that it is unlikely that we underreported the size of activity maps by not measuring map size beyond 6s from Cue.

      In fact, Fig 5F (which is highly thresholded) shows a surprisingly good match between the different forelimb actions, which argues against the existence of small subzones that are preferentially activated by different types of forelimb actions -the main claim of the authors.

      Our original proposal should have been more clearly stated. We were proposing that the thresholded maps, which had similar spatial organizations across conditions as the reviewer suggested, reported on subzones tuned for reach-to-grasp actions. Adjacent to those subzones could be other subzones that are preferentially active during other types of forelimb actions (e.g., pulling, pushing, grooming). We could not test this possibility in our study because the behavioral task examined a narrow range of arm and hand actions. We therefore revised the Discussion to state the limitations of our task and to lean more on published work that supports the present proposal (439-443 and 504-508).

      2) Related to the previous point, the ROI selections/definitions for the time course analyses seem highly arbitrary. As indicated in the introduction, the clustering hypothesis dictates that "an arm function would be concentrated in subzones of the motor arm zones. Neural activity in adjacent subzones would be tuned for other arm functions." To test this hypothesis directly in a straightforward manner, the authors could use the results from the ICM experiment to construct independent ROIs and to evaluate the ISOI responses for the different actions. In that case, the authors could do a straightforward ANOVA (if the data permits parametric analyses) with ROI, action, and time point (and possibly subject) as factors.

      We agree with the reviewer, and we now leverage the ICMS map for guiding ROI placement. All time courses are now derived from 1 of 2 types of ROIs. (1) Small ROIs (0.4 mm radius) placed in zones defined from ICMS (e.g., M1 hand zone). (2) Large ROIs that include the entire forelimb representations in M1 or in PMd (Fig 5B).

    1. Author Response:

      Reviewer #1 (Public Review):

      The authors interrogated an underexplored feature of CRISPR arrays to enhance multiplexed genome engineering with the CRISPR nuclease Cas12a. Multiplexing represents one of the many desirable features of CRISPR technologies, and use of highly compact CRISPR arrays from CRISPR-Cas systems allows targeting of many sites at one time. Recent work has shown though that the composition of the array can have a major impact on the performance of individual guide RNAs encoded within the array, providing ample opportunities for further improvements. In this manuscript, the authors found that the region within the repeat lost through processing, what they term the separator, can have a major impact on targeting performance. The effect was specifically tied to upstream guide sequences with high GC content. Introducing synthetic separator sequences shorter than their natural counterparts but exhibiting similarly low GC content boosted targeted activation of a reporter in human cells. Applying one synthetic separator to a seven-guide array targeting chromosomal genes led to consistent though more modest targeted activation. These findings introduce a distinct design consideration for CRISPR arrays that can further enhance the efficacy of multiplexed applications. The findings also suggest a selective pressure potentially influencing the repeat sequence in natural CRISPR arrays.

      Strengths:

      The portion of the repeat discarded through processing normally has been included or discarded when generating a CRISPR-Cas12a array. The authors clearly show that something in between-namely using a short version with a similarly low GC content-can enhance targeting over the truncated version. A coinciding surprising result was that the natural separator completely eliminated any measurable activation, necessitating the synthetic separator.

      The manuscript provides a clear progression from identifying a feature of the upstream sequences impacting targeting to gaining insights from natural CRISPR-Cas12a systems to applying the insights to enhance array performance.

      With further support, the use of synthetic separators could be widely adopted across the many applications of CRISPR-Cas12a arrays.

      Weaknesses:

      The terminology used to describe the different parts of the CRISPR array could better align with those in the CRISPR biology field. For one, crRNAs (abbreviated from CRISPR RNAs) should reflect the final processed form of the guide RNA, whereas guide RNAs (gRNAs) captures both pre-processed and post-processed forms. Also, "spacers" should reflect the natural spacers acquired by the CRISPR-Cas system, whereas "guides" better capture the final sequence in the gRNA used for DNA target recognition.

      We thank the reviewer for this correction. We have now changed most uses of “crRNA” to “gRNA”. We decided to retain the use of the word “spacer” for the target recognition portion of the gRNA rather than changing it to “guide” as the reviewer suggests, because we think there is a risk that the reader would confuse “guide” with the non-synonymous “guide-RNA”. We have added a remark explaining our use of “spacer” (“A gRNA consists of a repeat region, which is often identical for all gRNAs in the array, and a spacer (here used synonymously with “guide region”)”)

      A running argument of the work is that the separator specifically evolved to buffer adjacent crRNAs. However, this argument overlooks two key aspects of natural CRISPR arrays. First, the spacer (~30 nts) is normally much longer than the guide used in this work (20 nts), already providing the buffer described by the authors. This spacer also undergoes trimming to form the mature crRNA.

      If we understand this comment correctly, the argument is that, in contrast to a ~20-nt spacer, a 30-nt spacer would provide a buffer between adjacent guides even if a separator is not present. However, even a 30-nt spacer may have high GC content and form secondary structures that would interfere with processing of the subsequent gRNA. Our hypothesis is that the separator is AT-rich and so insulates gRNAs from one another regardless of the length or GC composition of spacers. Please let us know if we have misunderstood this comment.

      Second, the repeat length is normally fixed as a consequence of the mechanisms of spacer acquisition. At most, the beginning of each repeat sequence may have evolved to reduce folding interactions without changing the repeat length, although some of these repeats are predicted to fold into small hairpins.

      We agree with this comment. Indeed, we propose that the separator, which is part of the repeat sequence, has evolved to reduce folding interactions. We now clarify this at the end of the Results section: “Taken together, the results from our study suggest that the CRISPR-separator has evolved as an integral part of the repeat region that likely insulates gRNAs from the disrupting effects of varying GC content in upstream spacers.”

      Prior literature has highlighted the importance of a folded hairpin with an upstream pseudoknot within the repeat (Yamano Cell 2016), where disrupting this structure compromises DNA targeting by Cas12a (Liao Nat Commun 2019, Creutzburg NAR 2020). This structure is likely central to the authors' findings and needs to be incorporated into the analyses.

      We thank the reviewer for this important insight. We have now performed experiments exploring the involvement of the pseudoknot in the disruptive effects of high-GC spacers.

      First, we used our 2-gRNA CRISPR array design (Fig. 1D) where the second gRNA targets the GFP promoter and the first gRNA contains a non-targeting dummy spacer. We generated several versions of this array where we iteratively introduced targeted point mutations in the dummy spacer to either form a hairpin restricted to the dummy spacer, or a hairpin that would compete with the pseudoknot in the GFP-gRNA’s repeat region (new Fig. S3). We found that both of these modifications significantly reduced performance of the GFP-targeting gRNA. These results suggest that interfering with the pseudoknot indeed disrupts gRNA performance, but that also hairpins that presumably don’t interfere directly with the pseudoknot are detrimental – perhaps by sterically hindering Cas12a from accessing its cleavage site. Interestingly, the AAAT synSeparator largely rescued performance of the worst-performing of these constructs. These results are displayed in the new Fig. S3 and discussed in the related part of the Results section.

      Second, we have now performed a computational analysis using RNAfold where we correlated the performance of all dummy spacers with their predicted secondary structure (Fig. 1M). The correlation between predicted RNA structure and array performance was higher when the structural prediction included both the dummy spacer and the entire GFP-targeting gRNA (R2 = 0.57) than when it included only the dummy spacer (R2 = 0.27; new figure panel S1C). This higher correlation suggests that secondary structures that involve the GFP-targeting gRNA play a more important role in our experiment than secondary structures that only involve the dummy spacer. These results are described in the Results section and in the Fig. 1 legend.

      Third, we now also performed secondary structure analysis (RNAfold) of two of our worst-performing dummy spacers (50% and 70% GC), which indicated that these spacers are likely to form secondary structures that involve both the repeat and spacer of the downstream GFP-targeting gRNA (Fig. 3G-H). Interestingly, this analysis suggested that the AAAT synSeparator improves performance of these spacers by loosening up these secondary structures or creating an unstructured bulge at the Cas12a cleavage site. These results are presented in Fig. 3G-H and the accompanying portion of the Results section.

      To conclude, our analyses suggest that the secondary structure in the spacer and its interference with the pseudoknot in the repeat hairpin play a role in gRNA performance, wherein the inclusion of the AAAT synSeparator can partly rescue the performance, likely by restoring the Cas12a accessibility to the gRNA cleavage site.

      Many claims could better reflect the cited literature. For instance, Creutzburg et al. showed that adding secondary structures to the guide to promote folding of the repeat hairpin enhanced rather than interfered with targeting.

      We thank the reviewer for this comment. Creutzburg et al. report the interesting finding that a carefully designed 3’ extension of the spacer can counteract secondary structures that disrupt the repeat. In this way, the extension rescues disruptive secondary structures that involve the repeat and any upstream sequence. Relevant to this finding, it is conceivable that the synSeparator (AAAT) exerts its beneficial effect at the 3’ end of the GFP spacer by folding back onto the GFP spacer and in this way blocking secondary structures caused by a GC-rich dummy spacer located upstream of the GFP gRNA, according to the mechanism reported by Creutzburg et al. However, we used structural prediction of the GFP-targeting gRNA with and without the AAAT synSeparator and did not find evidence that the AAAT extension would cause this spacer to fold back onto itself (data not shown). Moreover, our experimental data (Fig. 3E) demonstrate that the synSeparator exerts its main beneficial effect when located upstream of the GFP-targeting gRNA, which would not be the case if the main mechanism was the one demonstrated by Creutzburg et al. We already had a paragraph discussing the Creutzburg paper in the Discussion, but we have now added a sentence specifying the mechanism that Creutzburg et al. demonstrated: “RNA secondary structure prediction (RNAfold) did not indicate that the GFP-targeting spacer would fold back on itself when an AAAT extension is added to the 3’ end, which would have been the case for the mechanism demonstrated by Creutzburg et al. (data not shown).”

      Liu et al. NAR 2019 further showed that the pre-processed repeat actually enhanced rather than reduced performance compared to the processed repeat.

      The experiment referenced by the reviewer (Fig. 2 in Liu et al., Nucleic Acids Research, 2019) in fact nicely supports our findings. In Liu et al., the pre-processed repeat only shows improved performance if it is located upstream of the targeting gRNA, and the gRNA is not followed by an additional pre-processed repeat (DRf-crRNA in their Fig. 2B & C). In this situation, the pre-processed repeat (containing the natural separator) may serve to enhance gRNA processing, as would be expected based on our results. At the same time, the absence of a full-length repeat downstream of the gRNA means that after gRNA processing, there will not remain any piece of RNA attached to the 3’ end of the spacer, which might disrupt gRNA performance. In contrast, when Liu et al. added an additional pre-processed repeat downstream of their gRNA (DRf-crRNA-DRf in the same panel), this construct performed the worst of all tested variants. This is consistent with our conclusion that the full-length separator reduces performance of gRNAs if it remains attached to the 3’ end of spacers. We have added a paragraph in the Discussion about this (Line 376).

      Finally, the complete loss of targeting with the unprocessed repeat appears represent an extreme example given multiple studies that showed effective targeting with this repeat (e.g. Liu NAR 2019, Zetsche Nat Biotechnol 2016).

      We acknowledge that our CRISPR array containing the full, natural separator (Fig. 3B) appears to be completely non-functional in contrast to the studies mentioned by the reviewer. We think this difference may have a few possible explanations. First, this array is in fact not entirely non-functional. Re-running the same experiment with a stronger dCas12a-activator (dCas12a-VPR, full length VPR, also used in Fig. 5) shows some modest GFP activation even with the full separator (1.4% vs 20.8% GFP+ cells; see the Appendix Figure 1). But for consistency, we have used the same, slightly less effective, dCas12a-activator (dCas12a-miniVPR) for all GFP-targeting experiments. Second, both the Liu et al. and Zetsche et al. studies used CRISPR editing rather than CRISPRa. We speculate that this might explain their relatively high indel frequency: Only a single cleavage event needs to take place for an indel to occur, whereas gene activation presumably requires the dCas12a-activator to be present on the promoter for extended periods of time. Thus, any inefficiency in DNA binding caused by the separator remaining attached to the spacer might disfavor CRISPRa activity more than CRISPR-editing activity. We have added these considerations to the Discussion and referenced the suggested papers (Line 376).

      Appendix Figure 1: Percentage of GFP+ cells without or with a full-length separator using dCas12a-VPR (full length) gene activation.

      Relating to the above point, the vast majority of the results relied on a single guide sequence targeting GFP. While the seven-guide CRISPR array did involve other sequences, only the same GFP targeting guide yielded strong gene activation. Therefore, the generalizability of the conclusions remains unclear.

      We have now performed several experiments that address the generalizability of our conclusions:

      First, we now include data demonstrating that the beneficial effect of adding a synSeparator is not limited to the AAAT sequence derived from the Lachnospiraceae bacterium separator. We now include three other 4-nt, AT-rich synSeparators derived from Acidaminococcus s. (TTTT), Moraxella b. (TTTA) and Prevotella d. (ATTT) (Fig. 3I). All these synSeparators rescued the poor GFP activation caused by an upstream spacer with high GC content, though not equally effectively. The quantitative difference between the synSeparators could either be due to the intrinsic “insulation capacity” of these sequences, or the way they interact with the Lb-Cas12a protein, or to sequence-specific interactions with this particular CRISPR array. We discuss these possibilities in the Discussion (Line 437).

      Second, we now include data demonstrating that nuclease-deactivated, enhanced-Cas12a from Acidaminococcus species (enAsdCas12a; Kleinstiver et al., 2019) is also sensitive to the effects of high-GC spacers (Fig. 3J). This poor performance was largely rescued by including a TTTT synSeparator derived from the natural AsCas12a separator.

      Furthermore, we have now included a paragraph in the Discussion where we speculate on why the effect of adding the synSeparator was more modest for the endogenous genes than for GFP: 1) Our GFP-expressing cell line has multiple GFP insertions in its genome, and each copy has seven protospacers in its promoter. This may amplify the effect of the synSeparator. 2) The gRNAs used for endogenous activation were taken from the literature or had been pre-tested by us. These guides had thus already proven to be successful and might not be particularly disruptive (e.g., they were not selected by us for having high GC content). Therefore, researchers might experience the greatest benefit from the synSeparator with newly designed spacers that have not already proven to be effective even without the synSeparator.

      Reviewer #3 (Public Review):

      Magnusson et al., do an excellent job of defining how the repeated separator sequence of Wild Type Cas12a CRISPR arrays impacts the relative efficacy of downstream crRNAs in engineered delivery systems. High-GC content, particularly near the 3' end of the separator sequence appears to be critically important for the processing of a downstream crRNA. The authors demonstrated naturally occurring separators from 3 Cas12a species also display reduced GC content. The authors use this important new information to construct a synthetic small separator DNA sequence which can enhance CRISPR/Cas12a-based gene regulation in human cells. The manuscript will be a great resource for the synthetic biology field as it shows an optimization to a tool that will enable improved multi-gene transcriptional regulation.

      Strengths:

      • The authors do an excellent job in citing appropriate references to support the rationale behind their hypotheses.
      • The experiments and results support the authors' conclusions (e.g., showing the relationship between secondary structure and GC content in the spacers).
      • The controls used for the experiments were appropriate (e.g., using full-length natural separator vs single G or 1 to 4 A/T nucleotides as synthetic separators).
      • The manuscript does a great job assessing several reasons why the synthetic separator might work in the discussion section, cites the relevant literature on what has been done and restates their results to argument in favor or against these reasons.
      • This paper will be very useful for research groups in the genome editing and synthetic biology fields. The data presented (specially the data concerning the activation of several genes) can be used as a comparison point for other labs comparing different CRISPR-based transcriptional regulators and the spacers used for targeting.
      • This paper also provides optimization to a tool that will be useful for regulating several endogenous genes at once in human cells thus helping researchers studying pathways or other functional relationships between several genes.

      Opportunities for Improvement:

      • The authors have performed all the experiments using LbCas12a as a model and have conclusively proven that the synSeparator enhances the performance of Cas12a based gene activation. Is this phenomenon will be same for other Cas12a proteins (such as AsCas12a)? The authors should perform some experiments to test the universality of the concept. Ideally, this would be done in HEK293T cells and one other human cell type.

      We thank the reviewer for these suggestions. We have now addressed the generalizability of our findings with several new experiments. First, we now include data demonstrating that nuclease-deactivated, enhanced Cas12a from Acidaminococcus species (denAsCas12a; Kleinstiver et al., 2019) is also sensitive to the effects of high-GC spacers (Fig. 3J). This poor performance was largely rescued by including a TTTT synSeparator derived from the natural AsCas12a separator.

      Second, we now include data demonstrating that the beneficial effect of adding a synSeparator is not limited to the AAAT sequence derived from the Lachnospiraceae b. separator. We now include three other 4-nt, AT-rich synSeparators derived from Acidaminococcus s. (TTTT), Moraxella b. (TTTA) and Prevotella d. (ATTT) (Fig. 3I). All these synSeparators rescued the poor GFP activation caused by an upstream spacer with high GC content, though not equally effectively. The quantitative difference between the synSeparators could either be due to the intrinsic “insulation capacity” of these sequences, or the way they interact with the Lb-Cas12a protein, or to sequence-specific interactions with this particular CRISPR array. We discuss these possibilities in the Discussion.

      Third, as described above, we have now performed an in vitro Cas12a cleavage assay and present the data in a new figure (Fig. 4). We found that a CRISPR array containing a 70%-GC dummy spacer was processed less efficiently than an array containing a 30%-GC spacer, but that addition of a synSeparator could to a large extent rescue this processing defect (Fig. 4E). The fact that this result was observed even in a cell-free in vitro setting demonstrates that it is a general feature of Cas12a CRISPR arrays that is likely to work the same way in many cell types rather than being specific to HEK293T cells.

      Fourth, we attempted to investigate the effect of the synSeparator in different cell types. However, either due to poor transfection efficiency or poor expression of the Cas12a activator construct, CRISPRa activity was consistently poor in these cell types, both with and without the synSeparator (e.g., we did not visually observe fluorescence from the mCherry gene fused to the dCas12a activator, which we always see in HEK293T cells). Because of the low general efficiency of CRISPRa, it was not possible to evaluate the performance of the synSeparator. Many cell types are difficult to transfect and dCas12a-VPR-mCherry is a big construct (>6 kb). To our knowledge, there have not been many reports using dCas12a-VPR in cell types other than HEK293T. While we think that it will be important to optimize CRISPRa in many cell types (e.g., by optimizing transfection conditions, Cas12a variants, promoters, expression vectors, etc.), the focus of our study has been to show the separator’s mechanism and general function; we believe that optimizing general CRISPRa for different cell types is beyond the scope of this paper. We acknowledge that this is a limitation of our study and we have added a paragraph about this in the Discussion (line 355). We nevertheless hypothesize that the negative influence of high-GC spacers and the insulating effect of synSeparators are generalizable across cell types. That is because we could observe improved array processing with the synSeparator even in the cell-free context of an in vitro expression system, as described above (Fig. 4). This suggests that the sensitivity to spacer GC content is determined only by the interaction between Cas12a and the array, rather than being dependent on a particular cellular context.

    1. Author Response

      Reviewer #1 (Public Review):

      However, the authors are cautioned to tone down some of the sentences with the human diabetic samples as they rely heavily on extrapolation rather experimental tests.

      Thank you for this feedback. We have added an experimental test to support the CellChat results. We found that, in accordance with the CellChat analysis, more macrophage Gas6 expression is observed in diabetic wounds via IF. These data are now included in Figures 3C-D. We have additionally edited the text relating to Figure 3 to indicate that these results are not fully conclusive.

      For instance, the antibody inhibition of Axl had minimal effect on the clearance of apoptotic cells in the wound and this would be expected with the redundancy endowed by other TAM receptors.

      Thank you for this point. We have made a note of this in the text in lines 289-291.

      For instance, in Figure 6, the number of TUNEL+ cells seem to be higher in the IgG samples compared to the anti-Timd4 treatment, but this is not the case in the quantification

      Thank you for this comment. We have replaced these with more representative images, which appear in Figure 6A. We also repeated the staining with antibodies for cleaved caspase 3, which appear in Fig. 6 – Fig. supplement 1A, which showed similar results.

      Reviewer #2 (Public Review):

      I suggest to repeat the quantification of cells containing active caspase-3 with an anticleaved caspase-3 antibody. Here the authors use an antibody recognizing phospho S150 antibody, which is far from generally accepted to be a marker for active caspase-3. It would also be good to quantify the apoptotic cells observed in the sections (Fig 1 I and J) and compare to control treatment on sections. It is not clear from the data presented whether the number of apoptotic cells increases or not in the time frame analyzed since the controls are lacking.

      Thank you for this important suggestion. We have repeated the IF staining using an antibody for cleaved caspase 3 (Cell Signaling 9661S) and quantified the apoptotic cells present. We found that apoptotic cells were rare but present at both 24h and 48h after injury, and that significantly more cleaved caspase 3+ cells were present in 48h wounds than 24h wounds. These data are now included in Figure 1H-J and Fig. 1 – Fig. supplement 1F. We have also used this antibody in IF staining in Fig. 5 – Fig. supplement 1B and Fig. 6 – Fig. supplement 1A.

      In a FACS analysis (Fig S1 H), the authors show that there is no increase in dead cells in a time frame of 48 hrs. Could it be that the majority of the cells that may have died in vivo, were lost during the procedure of tissue digestions. Dead cells tend to aggregate.

      Based on these comments and the inconsistency in these data due to potential technical challenges, we have removed the FACS data quantifying Annexin V. We now include the quantification of cleaved caspase 3 and an efferocytosis assay to analyze the kinetics of efferocytosis.

      On line 104 the authors refer to the apoptosis-inducing activity of G0s2. Please, realize that there is little or no in vivo evidence for a role of G0s2 in apoptosis.

      Thank you for this helpful comment. We have removed this gene from our analysis and text.

      The authors state that Axl is uniquely expressed in DC and fibroblasts (Fig 2). Are the Axlcells positive in panel G (red, Fig 2) that do not stain for the Pdgfra marker (green) then all DCs? Please clarify or show with a triple staining that these cells are indeed DCs.

      Thank you for this comment. To clarify, our intention was to show that both DCs and fibroblasts express Axl, not to say conclusively that only DCs and fibroblasts express Axl. Indeed, in Figure 5, we show that a portion of macrophages also express Axl (at day 3), so some of the Axl+ cells in 2G may be macrophages rather than DCs. We have made this more clear in the text in lines 163-166.

      In addition, it is not clear to me to what reference level exactly the expression levels are compared in Fig 2A. Is this between the 24 and 48h time points after wounding (as mentioned in the legend)? If so, the analysis may indicate up or down regulation but not necessarily expression or no expression.

      Thank you for making this point. The heatmaps display scaled log-normalized mRNA counts for the entire dataset, not a comparison between the two timepoints. We have clarified this in the figure legends.

      2) Human diabetic wounds display increased and altered efferocytosis signaling via Axl. This conclusion is solely based on CellChat analysis and should be tuned down or validated.

      Thank you for this suggestion. We have experimentally validated this conclusion using IF staining for Gas6. We found that more Gas6 staining in CD68+ macrophages in diabetic foot ulcers when compared to nondiabetic foot wounds. These data are now included in Figure 3C-D.

      The authors conclude that anti-Axl treatment leads to healing defects based on lack of granulation tissue and larger scabs, a reduction of fibroblast repopulation and revascularization. The differences in the last two parameters mentioned above are obvious, however the other parameters, as granulation tissue and scabs are less clear to me. Is this quantified in any way? In Fig S4 D there is also a large scab visible in the control treatment image. Therefore, it would be good if these parameters could be better substantiated.

      Thank you for this comment. We have edited the text in lines 301-304 to de-emphasize these qualitative changes.

      In view of the lack of revascularization, are there differences in the mRNA expression levels of angiogenic factors such as VEGF and others at this time point? Does revascularization occur at later stages?.

      Thank you for this helpful suggestion. We have used qPCR to measure Vegfa mRNA expression, and these data are now included in Figure 5I. We found no significant difference in Vegfa expression 5 days after injury.

      Based on the FACS analysis the authors claim that there are no differences at the level of DCs. However, the plots shown in Fig 5C do not convincingly show the detection of DC (as boxed in the lower panel). Based on the density plots one would presume this is just the continuation of the CD11b+ population and not a separate CD11c+ population. To get a better view of that, it would be better to show dot plots instead of density plots.

      Thank you for this insightful comment. We have created new plots as suggested to demonstrate that this is not exactly the case. In the wound bed, contrary to what we see in blood isolates many times the full separation of populations is elusive and to ensure that we use single stain controls to set the gates. Nonetheless, we provide in Author response image 1 the same data as dot-plots as requested to show that that is not the case, alongside the single stain control to show that the gating strategy is adequate. We do understand and acknowledge that in dissociated tissues sometimes the outlines are not as perfect as what is obtained in immunological samples.

      Author response image 1.

      Finally, the authors state (line 265-266) that anti-Axl treatment leads to non-significantly increased expression of IL1alpha and IL6 after one day of injury (Fig S4C). If the difference between the control-treated and the anti-Axl-treated group is statistically not significant I would not conclude there is an increase. Please adapt phrasing or include more mice in the experiment (now only 4) to substantiate the observation and clarify whether it is increased or not.

      Thank you for this comment. We have altered the text in lines 286-289 to better reflect this.

      The authors conclude that overall healing was not affected but that the wound beds appeared more fragile. What is meant with 'appeared more fragile' is not clear. In addition, this seems to me a quite subjective interpretation. What are the objective parameters to come this conclusion?

      Thank you for this point. We have altered the text to remove this subjective language.

      Similar to inhibition of Axl, inhibition of Timd4 led to a defect in revascularization as witnessed by the absence of CD31 staining. Also in this experiment one can raise similar questions as in the anti-Axl experiment: 1) does revascularization occur at a later timepoint; 2) what about the expression of angiogenic factors?

      Thank you for this helpful suggestion. To further investigate the impact of Axl inhibition of angiogenesis, we have assayed for Vegfa by qPCR. We found no significant difference in Vegfa expression 5 days after injury. These data are now included in Figure 5I.

      In the anti-Timd4 treated wounds the authors observe more TUNEL-positive cells and conclude that this is due to a defect in efferocytosis. However, the formal experimental proof for this in the current model is lacking. How do the authors exclude the possibility that anti-Timd4 treatment attracts more infiltrating cells that then undergo treatment, or that the treatment with anti-Timd4 leads to more apoptosis of certain cells in the wound bed. What is the nature of these apoptotic cells (neutrophils, T cells, others)? It has been shown that Timd4 can have stimulatory effects on other cells, such as T cells. Could deprivation of Timd4 signaling in certain conditions lead to more dying cells in this model?

      Thank you for this insightful comment. To investigate this, we have repeated this experiment with IF staining for cleaved caspase 3 and found similar results, indicating the increase in apoptosis upon Timd4 inhibition (Fig. 6 – Fig. supplement 1A). We have also included text to acknowledge the possibility of an increase in apoptosis in lines 326-327.

      Reviewer #3 (public Review)):

      They never do show that there is an increase in apoptotic cells in the wounds, which then go down (which would be a sign that the cells are being cleared via efferocytosis. In addition, they are looking for apoptotic cells at very early time points (24-48 hours), times at which large numbers of apoptotic cells would not be expected. As an example, neutrophil infiltration peaks at 24-48 hours and efferocytosis of apoptotic neutrophils would be expected after that. Other types of apoptotic cells would likely be cleared even later. Finally, several of the panels showing apoptotic cells were done with a very small number of samples (1-3 per group) in some cases so it is unclear how rigorous the data are. I would recommend that the authors at the very least soften the wording related to these conclusions and discuss the limitations of their experimental design; ideally data from more samples would be included to provide clear support those statements.

      Thank you for raising this important point. In order to support these claims, we have undertaken two additional experiments. Firstly, we have repeated the immunofluorescence staining with a new antibody for activated caspase 3 and quantified the number of apoptotic cells present in 24h and 48h wound beds. We found that apoptotic cells significantly increased in 48h wound beds compared to 24h wounds (Figures 1H and Fig. 1 – Fig. supplement 1F).

      We have also undertaken a new experiment to show the temporal regulation of efferocytosis. We injected stained apoptotic neutrophils into 1D, 3D, and 5D wound beds and quantified the stained cells remaining after 1 hour in order to quantify the clearance of cells from the wound bed at different timepoints. We found that significantly more stained cells undergoing efferocytosis remained in 5D wounds, and that the rate of efferocytosis was approximately constant over this timeline. These data are now included in Figures 2H-M.

      While we would be interested to determine the identities of cells engaging in efferocytosis of the labeled apoptotic neutrophils, we found that co-staining for additional cell markers was impossible while maintaining the fluorescent labeling on the injected neutrophils.

      2) The human RNA-seq data is also quite limited, as non-diabetic wound tissue was all from one patient. Again, this limitation should be acknowledged.

      Thank you for this feedback. We have analyzed new data sets that include 5 individuals with diabetic foot ulcers and 4 individuals with non-diabetic wounds. These data are now included in Figure 3.

      Also, there are some important published papers by Sashwati Roy's group indicating that there are defects in efferocytosis in diabetic wounds, which may go against what the authors are showing here to some degree. Discussion of the authors' work in relation to these other studies should be discussed.

      Thank you for this suggestion. We have included discussion of this work to the text in lines 192193.

      3) For anti-Axl and anti-Timd4 experiments, the authors conclude that inhibition of Axl does not affect TUNEL+ cells and that Timd4 does not affect reepithelialization. However, in some cases the sample size was only 3 mice per group when measuring these parameters. That is a very small number of samples to draw conclusions about apoptotic cells or reepithelialization since these parameters are key for the overall conclusions of the experiments. Given that these are key data, it would be important to include more than n=3. Additionally, as stated above, a time point later than 24 h may be necessary to actually see changes in apoptotic cells.

      Thank you for this suggestion. We have repeated the staining for apoptotic cells using a new antibody for cleaved caspase 3 and stained wound beds from additional mice. In the anti-Axl experiments, we now show data for cleaved caspase 3 staining of 3- and 5-day wound beds with N=4 (Fig. 5 – Fig. supplement 1B). In the anti-Timd4 experiments, we now have N=6-11 for the TUNEL staining at 5 days after injection and injury (Figure 6B).

      4) In Fig 6, there look to be many more TUNEL+ cells in the wound bed of IgG control samples compared to anti-Timd4-treated samples, which contradicts the graph. Perhaps the authors could clarify where they were taking their measurements for panels with image analysis results.

      Thank you for this helpful point. We have updated this figure to be more representative of the quantification (Figure 6A-B), as well as repeated the staining with antibodies for cleaved caspase 3 (Fig. 6 – Fig. supplement 1A).

      Another question related to this experiment is how it is possible that efferocytosis is so drastically different yet there are no changes in wound healing (this is one reason why a larger sample size for reepithelialization may be critical) - this would seem to suggest that efferocytosis is not important in wound healing, which is confusing. Further discussion on this might be useful.

      Thank you for this point. Indeed, we see that there is a defect to revascularization when Timd4 is inhibited (Figure 6E-F), which indicates that efferocytosis is important to normal healing. This is discussed in lines 333-335.

    1. Author Response:

      Reviewer #1:

      In this ms, Voroslakos et al., describe a customizable and versatile microdrive and head cap system for silicon probe recordings in freely moving rodents (mice and rats). While there are similar designs elsewhere, the added value here is: a) a carefully designed solution to facilitate probe recovery, thus reducing experimental costs and favoring reproducibility; b) flexibility to accommodate several microdrives and additional instrumentation; c) open access design and documentation to favor customization and dissemination. Authors provide detailed description to faccilitate building the system.

      Personally, I found this resource very useful to democratize multi-site recordings, not only for standard silicon probes, but also more novel integrated optoelectrodes and neuropixels. While there are other solutions, this design is quite simple and versatile. A potential caveat is whether it could be perceived as just an upgrade, given some similitudes with previous designs (e.g. Chung et al., Sci Rep 2017 doi: 10.1038/s41598-017-03340-5) and concepts (Headley et al., JNP doi: 10.1152/jn.00955.2014). However, the system presented in this paper provides added value and knowledge-based solutions to make silicon probe recordings more accessible.

      We thank the reviewer for carefully reading our manuscript and providing useful and constructive comments.

      Reviewer #2:

      This manuscript provides an updated guide on the procedures for performing chronic recordings with silicon probes in mice and rats in the lab of the senior author, who is one of the leaders in the use of this experimental method. The new set of procedures relies on metal and plastic 3D printed parts, and represents a major improvement over the older methodology (i.e. Vandecasteele et al. 2012).

      The manuscript is clearly written and the technical instructions (in the Methods section) seem rather detailed. The main concerns I had are as follows.

      We thank the reviewer for carefully reading our manuscript and providing useful and constructive comments.

      1) The present design is an improvement over Chung et al. (the most similar previously published explantable microdrive design, as far as I am aware) in terms of the footprint and travel distance. However, a main disadvantage of the system in its present form is that (apparently) it does not support Neuropixels probes. While such probes might not be suitable for some uses (e.g. to record from large populations in dorsal hippocampus), Neuropixels probes are of considerable interest to many labs.

      Our microdrive and head cap system can also support Neuropixels probes. Since our initial submission, we have implanted a Neuropixels probe in the intermediate hippocampus of a rat using our recoverable, plastic microdrive. At the end of the experiment, the Neuropixels probe was successfully recovered, cleaned, and implanted again in a new rat. In addition, we designed a new arm for our metal microdrive which can support Neuropixels probes (Figure 2) and implanted another rat (Figure 3 and 4). We have also created a video showing how to attach Neuropixels probe to a metal microdrive (Suppl. Video 3).

      Figure 2. Metal microdrive adapter for Neuropixels probe. A Arm design for 64-channel silicon probes. 45o, front, side and top views are shown (from left to right). All dimensions are in mm. B Changing the overall length (from 7.35 mm to 10 mm) and width (from 4 mm to 5.4 mm) of the 64-channel arm makes our metal microdrive compatible with Neuropixels probe. Note, that only three dimensions of the 64-channel arm were modified (red numbers). 45-degree, front, side and top views are shown (from left to right). All dimensions are in mm. C Photograph of the different arm designs of the metal, recoverable microdrive (top shows an arm designed for a 64-ch silicon probe, bottom shows an arm designed for Neuropixels probe).

      Figure 3. Recording of unit firing with Neuropixels probe attached to a metal microdrive in freely moving rat. A Metal microdrive for Neuropixels probe (a – stereotax attachment, b – drive holder, c – metal microdrive, d – Neuropixels probe and e – Neuropixels headstage). B Photo of Neuropixels probe attached to a metal microdrive (a-e same as in A). C Location of probe implantation (Bregma - 4.8 mm, mediolateral + 4.6 mm, 11-degree angle). D High pass filtered traces (1s) from a freely moving rat implanted with Neuropixels probe. Note the single unit activity in the cellular layer of cortex (top) and hippocampus (bottom).

      Figure 4. Implantation of Neuropixels probe in a rat using metal microdrive and rat cap system. A The base of the rat cap is attached to the skull. Reference (ref) and ground (gnd) screws are placed over the cerebellum. Neuropixels probe is mounted on a metal microdrive. The microdrive is held by the drive holder and attached to a stereotax arm using the stereotax attachment. For more details, see video: Neuropixels_attachment.mp4. B Once the probe is inserted to its final depth (left), the base of the microdrive is cemented to the skull (zoomed in photograph on the right). C The surface of the brain is kept wet using saline during probe insertion and during cementing the base of the microdrive. D After the base is cemented, the craniotomy is sealed with bone wax. E Releasing the drive from the drive holder. Once it is released the stereotax arm is moved upwards. F Neuropixels headstage is removed from the male header of the stereotax attachment (soldering joint) and placed on the animals back. G The walls of the cap system are attached to the base. Ground and reference wires are soldered to the probe (not shown). H The male header of the headstage is secured to the walls. The headstage and its cable are oriented to allow easy access to the screw head of the microdrive. Note, that there is enough room for custom connectors inside the rat cap.

      2) The total weight of the mouse implant seems quite high (together with the headstage, I estimate it is >= 4gr). Could the authors provide the exact value, and describe whether this has any impact on the way the animal moves? Also, the authors should describe how the animals are housed (e.g. do they carry the headstage even when not being recorded). The authors say that a mouse can be implanted with more than one microdrive. The authors should clarify whether they actually have an experience with such implants, or is this just a suggestion based on their educated estimate?

      The total weight of the metal microdrive, including the base, body and arm is 0.87 gram. Additional weight is the metabond and dental acrylic cement. The amount of cement that is used during surgery can vary between researchers and the type of surgery. The overall weight of the assembly also depends on the silicon probe with Omnetics connector(s) that is used for the surgery, e.g.: 32-channel micro-LED probe is 1.11g (NeuroLight Technologies LTD.), 64-channel 4-shank probe is 0.96g (ASSY E-1, Cambridge NeuroTech), 64-channel 5-shank probe is 1.05g (A5x12- 16-Buz-Lin-5mm, NeuroNexus Ltd.) and a 128-channel 4-shank probe with integrated Intan chips is 0.94g (P128-5, Diagnostic Biochips). In addition, the overall weight of the entire assembly can change if optic fibers are used in optogenetic studies or if any custom connectors are implanted (e.g., connector and wires for brain stimulation). That is the reason why we reported the overall weight of each system (metal microdrive, mouse cap and rat cap) individually.

      The implanted mice are single housed, and they do not carry the headstage while in the vivarium. During recordings, the headstage is attached and a counterbalanced pulley system ensures that the animal is not carrying the extra weight of the headstage. We have quantitatively compared running speed with traditional and the new head caps in both rats and mice (Fig. 6).

      The small footprint of the metal microdrive enables researchers to perform more than one silicon probe implantation in freely moving mice. For this purpurse, larger mice (>35 g) are selected (Figure 5).

      Figure 5. Metal microdrive enables double silicon probe recordings in freely moving mice. A Intraoperative photograph of double silicon probe implantation. Note that the metal microdrive on the left had been secured to the skull and the second drive is being implanted using the stereotaxic attachment and drive holder. The probe PCBs are placed on the copper mesh. B Photograph focused on the metal microdrives.

      3) There is no information in the results section on the number of implants performed, the duration the animals were implanted, the quality of the recordings obtained, number of successes or failures failures. The figures merely provide examples of one successful recording in a mouse and in a rat. All these details should be provided, along with details of how many probes were reused and how many times (a brief mention of one case, lines 252-253 and 359-360, is not sufficient).

      We have added a Supplementary Table explaining all the details of our implants. We would like to refer the Reviewer to response #1 to Reviewer 1.

      Adapting new technology is challenging. To date, we have extensive experience with the rat cap system only (n=3 users in the lab, n = 25 rats implanted). Two lab members have started to adapt our mouse cap and implanted 3 mice since our submission. We included their maze running behavioral data for comparison between the copper mesh and cap system.

      Prior to the development of the metal microdrive, we have conducted an internal lab survey comparing the hand-made microdrive (Vandecasteele et al., 2012) and our recoverable, plastic microdrive. Six lab members who had extensive experience with both types participated (Figure 6). Our questions were:

      1) On a scale 1-10, how would you compare the plastic, recoverable drive to the Vandecasteele et. al. 2012 one in terms of: a) ease of building a drive, b) size and c) ease of recovery.

      Figure 6. Internal lab survey using recoverable, plastic microdrives. A User feedback based on four criteria: ease of building, ease of implantation, size, ease of recovery. The 3D printed microdrive surpasses the manually built drive (Vandecasteele et. al., 2012) on every parameter except the size. B 24 silicon probes were used with the recoverable plastic microdrive. On average each probe was recovered two times. Out of these 48 recovery attempts 5 failed only. There were 2 total losses during recovery and in three cases different number of shanks broke during the recovery process making the recovery partially successful. One major limitation of reusability is the sudden increase in impedance over time (we have to discard 30% of the successfully recovered probes due this reason). Researchers in our lab spend on average 30 minutes to recover a silicon probe.

      Overall, the success rate of recovery is much higher using a recoverable microdrive system, but the size of the plastic, recoverable microdrive is limits certain experiments. This was one of the main motivations to develop the metal, recoverable microdrive.

      4) In fig. 2, spike waveforms are classified as pyramidal, wide or narrow interneurons. I did not find any description of how this classification was performed.

      We have removed the single cell putative cell types from the manuscript as this issue is not relevant to the current manuscript. Figure 2 has been simplified and a new figure 5 is dedicated to the single cell quantification.

      5) Also in fig. 2, refractory period violations are reported in percent (permille in fact). First, it is not clear how refractory period was defined. Second, such quantification is incorrect in principle: we use refractory period violations to infer the rate of false positives. Yet the relationship between fraction of ISI violations and false positive rate depends on the firing rate of the neuron. For example, 0.1% of ISI violations is quite good for a unit spiking at 10 spikes/s, is so so for a unit spiking at 1 spike/s, and is very bad if the firing rate is 0.1 spike/s (see Hill et al. JNeurosci. 2011 for derivation). Alternatively, the authors can follow an approach described in an old paper by the same lab (Harris et al., JNeuropsysiol. 2000), quantifying the violations in spike autocorrelogram relative to its asymptotic height.

      We have removed this panel from Fig. 2 and dedicated a new figure (Fig. 5) to the single cell quantification. Refractory violations can be used as an alarm for poor cluster quality. Absence of refractory violations alone does not guarantee good separation for the reasons the Reviewer mentioned.

      6) Line 477: the authors write that the probes were mounted on a plastic microdrive. This seems to contradict the key claim of the manuscript (namely that the microdrives were from stainless steel).

      We apologize if this description was not clear in the original manuscript. In the revised version, we have added a table (Suppl. Table 1) explaining all details of each animal subject (species, strain, weight, cap type), type of silicon probe and microdrive used. As we explained in Response 3, our main goal was to test each system individually and once all components have been verified, we combined everything into one surgery.

      The plastic and metal microdrives are based on the same principles. The implantation/recovery tools are also identical in design concepts. Based on our own experience, users dol not recognize any changes in terms of ease of use, ease of implantation and ease of recovery when changing from plastic recoverable microdrives to metal ones. The advantage of metal drives is size reduction, their multiple reusability and stability.

      7) I believe that the work of Luo & Bondy et al. (eLife 2020) and should be references and compared to.

      We reference Luo et. al. (2020) in our revised manuscript. One of the main advantages of using a microdrive system is the ability to move the recording probe inside the brain tissue and sample new sets of neurons. This is not the case in Luo & Bondy et al. (eLife 2020).

    1. Author Response:

      Reviewer #1 (Public Review):

      Summary

      The authors have discovered and characterized a novel genetic pathway responsive to hypoxia, which acts in parallel to the canonical response through activation of Hypoxia-Inducible Factor (HIF). Specifically, the authors discovered that the Caenorhabditis elegans nuclear hormone receptor NHR-49, ortholog to mammalian PPAR-alpha, is essential for survival under hypoxic conditions and regulates target gene expression that is hif-1-independent; identifying an essential role of autophagy. Further the authors discover both positive and negative regulators of NHR-49 and a putative feedback loop.

      Overall analysis

      The genetic analysis conducted by the authors is outstanding. However, the study is lacking in a few key areas and the authors may have over-interpreted results in a few places, which diminishes my overall enthusiasm. These concerns are addressable and doing so would greatly strengthen the manuscript. I highlight individual major concerns below, and save minor concerns and specific suggestions for private recommendations for the authors.

      Major concerns

      1 The authors have provided strong genetic evidence for a parallel mechanism to canonical HIF-1 activity in response to hypoxia. The authors should more rigorously test whether there is evidence for cross-talk between the two mechanisms. In the discussion the authors' highlight findings in mammals that support this possibility. For example, does loss of one lead to hyperactivation of the other in an attempt to compensate for hypoxia?

      We thank the reviewer for suggesting these interesting experiments to examine cross-talk!

      Specific examples:

      • In regards to lines 425-426, does loss of hpk-1 stabilize HIF-1 (or does hpk-1(oe) repress hif-1)?

      We attempted to study HPK-1–HIF-1 cross talk via GFP imaging of the UL1447 HIF-1::GFP strain after hpk-1 RNAi (Figure R4, below). However, although we did observe an increase in GFP levels in hypoxia (vs. normoxia), we did not observe nuclear localization, possibly due to the rapid degradation of HIF-1 in normoxia, which occurs inevitably during our experimental procedure. We therefore opted not to include these data in the manuscript.

      Figure R4: Regulation of HIF-1::GFP. Quantification of GFP levels in UL1447 (unc-119(ed3) III; leEx1447 [hif-1::GFP + unc-119(+)) adult animals expressing HIF-1::GFP. Animals were fed EV RNAi or nhr-49, hif-1, hpk-1, or nhr-67 RNAi as indicated and exposed to 4 hr of 0.5% O2 without recovery (three repeats totalling >30 individual animals per strain). X/, XXX,**** p <0.05, 0.001, 0.0001 (two-way ANOVA corrected for multiple comparisons using the Tukey method).

      • Does loss of hif-1 or nhr-49 alter the expression, stability, or activity of the other (either under normoxic or hypoxic conditions)?

      We appreciate the reviewer’s interest in examining the interaction between nhr-49 and hif-1. To address this, we generated an NHR-49::GFP;hif-1(-) strain and analysed it by imaging after exposure to normoxia or hypoxia. Although loss of hif-1 does result in a slight whole-body up-regulation of NHR-49::GFP, this increase was not significant (new Figure 2—figure supplement 1C, D). Higher magnification images did not show a tissue-specific effect in NHR-49::GFP increase in the hif-1(-) background either (new Figure 2D, Figure 2—figure supplement 1E, F). For reasons mentioned above, the HIF-1::GFP;nhr-49(RNAi) experiment was inconclusive.

      • Can overexpression of either hif-1 or nhr-49 rescue the developmental defects caused by loss of the other (i.e. overexpress hif-1 in nhr-49 mutant animals, and vice versa).

      With the new NHR-49::GFP;hif-1(-) strain, we were able to study compensatory effects of overexpressing NHR-49 in hif-1 mutants by performing embryo hypoxia survival experiments (new Figure 2E). Excitingly, while NHR-49 overexpression does not provide enhanced hypoxia survival at baseline (vs. non-GFP siblings), NHR-49 overexpression rescued the deficiency of hif-1 mutants. This suggests that nhr-49 can partially compensate for loss of the hif-1 pathway. Testing whether HIF-1::GFP overexpression rescues nhr-49 loss requires non-GFP sibling controls. Although the UL1447 strain expresses HIF-1::GFP from an extrachromosomal array, in our hands, we never observed non-GFP worms (i.e. 100% HIF-1::GFP offspring), and therefore were unable to test whether HIF-1 overexpression compensates for nhr-49 loss.

      • Does NHR-67 negatively regulate hif-1 (specificity to NHR-49)?

      As noted above, we were unfortunately unable to conclusively assessed HIF-1::GFP levels, likely due to rapid degradation during the normoxia that occurs during animal harvest.

      2 The role of autophagy in hypoxia should be explored in greater detail. While the evidence presented by the authors clearly demonstrates autophagy is essential for hypoxic survival, autophagy is an important component of many biological processes. Thus, it's critical to distinguish whether autophagy is merely required (perhaps for very indirect reasons) or whether autophagy is a part of an adaptive response to hypoxia. The authors (Miller lab) previously failed to find a role for autophagy in hypoxia (Fawcett et al. 2015 Aging Cell), which should be addressed. Has autophagy been previously linked to hypoxia in C. elegans? The novelty of this discovery should be discussed in greater detail.

      We appreciate that the link of autophagy to hypoxia survival needed to be examined further. We now provide substantial new evidence showing that not only are autophagy genes and autophagosome formation induced in hypoxia, but also that mutations in autophagy genes result in hypoxia sensitivity. In our opinion, this strongly supports a key role for autophagy in hypoxia adaptation.

      We note that the study by Fawcett et al., 2015 studied only two genes in hypoxia, bec-1 and unc-51, none of which were found to be regulated by hypoxia in our RNA-seq analysis. Another study from the Miller lab found that an 18-hour anoxia exposure of L2/L3 stage C. elegans results in a significant induction of autophagy in the intestine (Chapin et al., 2015; Fig 4B, C). Although the conditions in this study are different than in ours (anoxia vs. hypoxia, exposure time, animal developmental stage), this study, like ours, thus finds that that low oxygen availability induces autophagy. Besides the Miller lab, there are several other publications that show an important role for autophagy in hypoxia adaptation across species (Samokhvalov et al., 2008; Zhang et al., 2008). Especially relevant to our manuscript is a recent paper published while we were revising our manuscript, which shows that autophagy gene induction is HIF-1 independent in Drosophila melanogaster (Valko et al., 2021). This agrees well with our exciting new discoveries. We have revised the text to better discuss this context.

      3 The authors have possibly over-interpreted their results in Figure 4B and the possibility that NHR-49 acts cell non-autonomously. The authors speculate that tissue specific genetic rescue by NHR-49 over-expression could indicate the existence of a signaling molecule (line 499). Ectopic over-expression of a transcription factor within one tissue is always tricky to interpret, as it may not be physiologically relevant, which I fear may be the case as rescue is achieved when NHR-49 is over-expressed within any tissue (i.e. there is no specificity). An alternative explanation, which is a more indirect model, is that NHR-49 over-expression shifts metabolism within a tissue to generate metabolites that are released throughout the organism to sustain it during hypoxia.

      We thank the reviewer for this excellent point, and agree that indirect action of NHR-49 remains a possibility. We have added discussion to this point in the revised manuscript.

      4 As an extension of MC#3, the authors demonstrate that NHR-49 is induced throughout the animal after hypoxia (Figure 5A). Presumably sites of NHR-49 induction (tissues) equates to the sites where nhr-49 is necessary. However, the images within 5A cannot be resolved to identify individual tissues, higher resolution images are necessary and quantification of GFP expression within individual tissues could lend biological insight.

      We now provide higher resolution images of NHR-49::GFP in Figure 2D, Figure 2—figure supplement 1E, F.

      5 The gene expression analysis is lacking details. For example, the RNA-seq data shown in Figure 3A&B is confusing. The numbers in the text do not match the figure and it is unclear whether the intersection in the Venn Diagram represent inverse relationships (i.e. the proportion of genes that are upregulated in wild-type that are either hif-1 or nhr-49 dependent). Greater detail and explanation is needed, as presented little biological insight can be discerned from the Figure 3A&B. Next, qRT-PCR validation of autophagy gene expression found in Figure 3C should be provided with that result. Lastly, are there existing datasets for changes in gene expression of C. elegans exposed to hypoxia? If so, how do the datasets compare?

      We apologize for the confusion and have revised our text describing the RNA-seq analysis as well as the Figure legend. We also provide validation of the RNA-seq data with GFP-reporters and have compared our dataset to a previous study on hypoxia dependent gene regulation in C. elegans.

      6 The authors identify a putative negative feedback loop between NHR-67 and NHR-49, and suggest this regulation is at the protein level (Figure 5F,G) based on a translational reporter and not transcriptional regulation based on qRT-PCR results and similar results previously found with hpk-1 (Figures S5A, 7a, and a previous study). However, the authors should more rigorously rule out dynamic changes in expression between tissues that cannot be ascertained by qRT-PCR (i.e. test whether nhr-49p::GFP expression is altered after nhr-67(RNAi) +/- hypoxia.

      We agree and have more rigorously studied this interaction.

    1. Author response:

      Reviewer #1 (Public Review):

      Reviewer #1, comment #1: The study is thorough and systematic, and in comparing three well-separated hypotheses about the mechanism leading from grid cells to hexasymmetry it takes a neutral stand above the fray which is to be particularly appreciated. Further, alternative models are considered for the most important additional factor, the type of trajectory taken by the agent whose neural activity is being recorded. Different sets of values, including both "ideal" and "realistic" ones, are considered for the parameters most relevant to each hypothesis. Each of the three hypotheses is found to be viable under some conditions, and less so in others. Having thus given a fair chance to each hypothesis, nevertheless, the study reaches the clear conclusion that the first one, based on conjunctive grid-by-head-direction cells, is much more plausible overall; the hypothesis based on firing rate adaptation has intermediate but rather weak plausibility; and the one based on clustering of cells with similar spatial phases in practice would not really work. I find this conclusion convincing, and the procedure to reach it, a fair comparison, to be the major strength of the study.

      Response: Thanks for your positive assessment of our manuscript.

      Reviewer #1, comment #2: What I find less convincing is the implicit a priori discarding of a fourth hypothesis, that is, that the hexasymmetry is unrelated to the presence of grid cells. Full disclosure: we have tried unsuccessfully to detect hexasymmetry in the EEG signal from vowel space and did not find any (Kaya, Soltanipour and Treves, 2020), so I may be ranting off my disappointment, here. I feel, however, that this fourth hypothesis should be at least aired, for a number of reasons. One is that a hexasymmetry signal has been reported also from several other cortical areas, beyond entorhinal cortex (Constantinescu et al, 2016); true, also grid cells in rodents have been reported in other cortical areas as well (Long and Zhang, 2021; Long et al, bioRxiv, 2021), but the exact phenomenology remains to be confirmed.

      Response: Thank you for the suggestion to add the hypothesis that the neural hexasymmetry observed in previous fMRI and intracranial EEG studies may be unrelated to grid cells. Following your suggestion, we have now mentioned at the end of the fourth paragraph of the Introduction that “the conjunctive grid by head-direction cell hypothesis does not necessarily depend on an alignment between the preferred head directions with the grid axes”. Furthermore, at the end of section “Potential mechanisms underlying hexadirectional population signals in the entorhinal cortex” (in the Discussion) we write: “However, none of the three hypotheses described here may be true and another mechanism may explain macroscopic grid-like representations. This includes the possibility that neural hexasymmetry is completely unrelated to grid-cell activity, previously summarized as the ‘independence hypothesis' (Kunz et al., 2019). For example, a population of head-direction cells whose preferred head directions occur at offsets of 60 degrees from each other could result in neural hexasymmetry in the absence of grid cells. The conjunctive grid by head-direction cell hypothesis thus also works without grid cells, which may explain why grid-like representations have been observed (using fMRI) in regions outside the entorhinal cortex, where rodent studies have not yet identified grid cells (Doeller et al., 2010; Constantinescu et al., 2016). In that case, however, another mechanism would be needed that could explain why the preferred head directions of different head-direction cells occur at multiples of 60 degrees. Attractor-network structures may be involved in such a mechanism, but this remains speculative at the current stage.” We now also mention the results from Long and Zhang (second paragraph of the Introduction): “Surprisingly, grid cells have also been observed in the primary somatosensory cortex in foraging rats (Long and Zhang, 2021).”

      Regarding your EEG study, we have added a reference to it in the manuscript and state that it is an example for a study that did not find evidence for neural hexasymmetry (end of first paragraph of the Discussion): “We note though that some studies did not find evidence for neural hexasymmetry. For example, a surface EEG study with participants “navigating” through an abstract vowel space did not observe hexasymmetry in the EEG signal as a function of the participants’ movement direction through vowel space (Kaya et al., 2020). Another fMRI study did not find evidence for grid-like representations in the ventromedial prefrontal cortex while participants performed value-based decision making (Lee et al., 2021). This raises the question whether the detection of macroscopic grid-like representations is limited to some recording techniques (e.g., fMRI and iEEG but not surface EEG) and to what extent they are present in different tasks.”

      Reviewer #1, comment #3: Second, as the authors note, the conjunctive mechanism is based on the tight coupling of a narrow head direction selectivity to one of the grid axes. They compare "ideal" with "Doeller" parameters, but to me the "Doeller" ones appear rather narrower than commonly observed and, crucially, they are applied to all cells in the simulations, whereas in reality only a proportion of cells in mEC are reported to be grid cells, only a proportion of them to be conjunctive, and only some of these to be narrowly conjunctive. Further, Gerlei et al (2020) find that conjunctive grid cells may have each of their fields modulated by different head directions, a truly surprising phenomenon that, if extensive, seems to me to cast doubts on the relation between mass activity hexasymmetry and single grid cells.

      Response: We have revised the manuscript in several ways to address the different aspects of this comment.

      Firstly, we agree with the reviewer that our “Doeller” parameter for the tuning width is narrower than commonly observed. We have therefore reevaluated the concentration parameter κ_c in the ‘realistic’ case from 10 rad-2 (corresponding to a tuning width of 18o) to 4 rad-2 (corresponding to a tuning width of 29o). We chose this value by referring to Supplementary Figure 3 of Doeller et al. (2010). In their figure, the tuning curves usually cover between one sixth and one third of a circle. Since stronger head-direction tuning contributes the most to the resulting hexasymmetry, we chose a value of κ_c=4 for the tuning parameter, which corresponds to a tuning width (= half width) of 29o (full width of roughly one sixth of a circle). Regarding the coupling of the preferred head directions to the grid axes, the specific value of the jitter σc = 3 degrees that quantifies the coupling of the head-direction preference to the grid axes was extracted from the 95% confidence interval given in the third row of the Table in Supplementary Figure 5b of Doeller et al. 2010. We now better explain the origin of these values in our new Methods section “Parameter estimation” and provide an overview of all parameter values in Table 1.

      Furthermore, in response to your comment, we have revised Figure 2E to show neural hexasymmetries for a larger range of values of the jitter (σc from 0 to 30 degrees), going way beyond the values that Doeller et al. suggested. We have also added a new supplementary figure (Figure 2 – figure supplement 1) where we further extend the range of tuning widths (parameter κ_c) to 60 degrees. This provides the reader with a comprehensive understanding of what parameter values are needed to reach a particular hexasymmetry.

      Regarding your comments on the prevalence of conjunctive grid by head-direction cells, we have revised the manuscript to make it explicit that the actual percentage of conjunctive cells with the necessary properties may be low in the entorhinal cortex (first paragraph of section “A note on our choice of the values of model parameters” of the Discussion): “Empirical studies in rodents found a wide range of tuning widths among grid cells ranging from broad to narrow (Doeller et al., 2010; Sargolini et al., 2006). The percentage of conjunctive cells in the entorhinal cortex with a sufficiently narrow tuning may thus be low. Such distributions (with a proportionally small amount of narrowly tuned conjunctive cells) lead to low values in the absolute hexasymmetry. The neural hexasymmetry in this case would be driven by the subset of cells with sufficiently narrow tuning widths. If this causes the neural hexasymmetry to drop below noise levels, the statistical evaluation of this hypothesis would change.” In addition, in Figure 5, we have applied the coupling between preferred head directions and grid axes to only one third of all grid cells (parameter pc= ⅓ in Table 1), following the values reported by Boccara et al. 2010 and Sargolini et al. 2006. To strengthen the link between Figure 5 and Figure 2, we now state the hexasymmetry when using pc= ⅓ along with a ‘realistic’ tuning width and jitter for head-direction modulated grid cells in Figure 2H. Additionally, we performed new simulations where we observed a linear relationship (above the noise floor) between the proportion of conjunctive cells and the hexasymmetry. This shall help the reader understand the effect of a reduced percentage of conjunctive cells on the absolute hexasymmetry values. We have added these results as a new supplementary figure (Figure 2 – figure supplement 2).

      Finally, regarding your comment on the findings by Gerlei et al. 2020, we now reference this study in our manuscript and discuss the possible implications (second paragraph of section “A note on our choice of the values of model parameters” of the Discussion): “Additionally, while we assumed that all conjunctive grid cells maintain the same preferred head direction between different firing fields, conjunctive grid cells have also been shown to exhibit different preferred head directions in different firing fields (Gerlei et al., 2020). This could lead to hexadirectional modulation if the different preferred head directions are offset by 60o from each other, but will not give rise to hexadirectional modulation if the preferred head directions are randomly distributed. To the best of our knowledge, the distribution of preferred head directions was not quantified by Gerlei et al. (2020), thus this remains an open question.”

      Reviewer #1, comment #4: Finally, a variant of the fourth hypothesis is that the hexasymmetry might be produced by a clustering of head direction preferences across head direction cells similar to that hypothesized in the first hypothesis, but without such cells having to fire in grid patterns. If head direction selectivity is so clustered, who needs the grids? This would explain why hexasymmetry is ubiquitous, and could easily be explored computationally by, in fact, a simplification of the models considered in this study.

      Response: We fully agree with you. We now explain this possibility in the Introduction where we introduce the conjunctive grid by head-direction cell hypothesis (fourth paragraph of the Introduction) and return to it in the Discussion (section “Potential mechanisms underlying hexadirectional population signals in the entorhinal cortex”). There, we now also explain that in such a case another mechanism would be needed to ensure that the preferred head directions of head-direction cells exhibit six-fold rotational symmetry.

      Reviewer #2 (Public Review):

      Reviewer #2, comment #1: Grid cells - originally discovered in single-cell recordings from the rodent entorhinal cortex, and subsequently identified in single-cell recordings from the human brain - are believed to contribute to a range of cognitive functions including spatial navigation, long-term memory function, and inferential reasoning. Following a landmark study by Doeller et al. (Nature, 2010), a plethora of human neuroimaging studies have hypothesised that grid cell population activity might also be reflected in the six-fold (or 'hexadirectional') modulation of the BOLD signal (following the six-fold rotational symmetry exhibited by individual grid cell firing patterns), or in the amplitude of oscillatory activity recorded using MEG or intracranial EEG. The mechanism by which these network-level dynamics might arise from the firing patterns of individual grid cells remains unclear, however.

      In this study, Khalid and colleagues use a combination of computational modelling and mathematical analysis to evaluate three competing hypotheses that describe how the hexadirectional modulation of population firing rates (taken as a simple proxy for the BOLD, MEG, or iEEG signal) might arise from the firing patterns of individual grid cells. They demonstrate that all three mechanisms could account for these network-level dynamics if a specific set of conditions relating to the agent's movement trajectory and the underlying properties of grid cell firing patterns are satisfied.

      The computational modelling and mathematic analyses presented here are rigorous, clearly motivated, and intuitively described. In addition, these results are important both for the interpretation of hexadirectional modulation in existing data sets and for the design of future experiments and analyses that aim to probe grid cell population activity. As such, this study is likely to have a significant impact on the field by providing a firmer theoretical basis for the interpretation of neuroimaging data. To my mind, the only weakness is the relatively limited focus on the known properties of grid cells in rodent entorhinal cortex, and the network level activity that these firing patterns might be expected to produce under each hypothesis. Strengthening the link with existing neurobiology would further enhance the importance of these results for those hoping to assay grid cell firing patterns in recordings of ensemble-level neural activity.

      Response: Thank you very much for reviewing our manuscript and your positive assessment. Following your comments, we have revised the manuscript to more closely link our simulations to known properties of grid cells in the rodent entorhinal cortex.

      Reviewer #3 (Public Review):

      Reviewer #3, comment #1: This is an interesting and carefully carried out theoretical analysis of potential explanations for hexadirectional modulation of neural population activity that has been reported in the human entorhinal cortex and some other cortical regions. The previously reported hexadirectional modulation is of considerable interest as it has been proposed to be a proxy for the activation of grid cell networks. However, the extent to which this proposal is consistent with the known firing properties of grids hasn't received the attention it perhaps deserves. By comparing the predictions of three different models this study imposes constraints on possible mechanisms and generates predictions that can be tested through future experimentation.

      Overall, while the conclusions of the study are convincing, I think the usefulness to the field would be increased if null hypotheses were more carefully considered and if the authors' new metric for hexadirectional modulation (H) could be directly contrasted with previously used metrics. For example, if the effect sizes for hexadirectional modulation in the previous fMRI and EEG data could be more directly compared with those of the models here, then this could help in establishing the extent to which the experimental hexadirectional modulation stands out from path hexasymmetry and how close it comes to the striking modulation observed with the conjunctive models. It could also be helpful to consider scenarios in which hexadirectional modulation is independent of grid firing, for example perhaps with appropriate coordination of head direction cell firing.

      Response: Thanks for reviewing our manuscript and for the overall positive assessment. The new Methods section “Implementation of previously used metrics” starts with the following sentences: “We applied three previously used metrics to our framework: the Generalized Linear Model (GLM) method by Doeller et al. 2010; the GLM method with binning by Kunz et al. 2015; and the circular-linear correlation method by Maidenbaum et al. 2018.” We have created a new supplementary figure (Figure 5 – figure supplement 4) in which we compare the results from these other methods to the results of our new method. Overall, the results are highly similar, indicating that all these methods are equally suited to test for a hexadirectional modulation of neural activity.

      In section “Implementation of previously used metrics” we then explain: “In brief, in the GLM method (e.g. used in Doeller et al., 2010), the hexasymmetry is found in two steps: the orientation of the hexadirectional modulation is first estimated on the first half of the data by using the regressors and on the time-discrete fMRI activity (Equation 9), with θt being the movement direction of the subject in time step t. The amplitude of the signal is then estimated on the second half of the data using the single regressor , where . The hexasymmetry is then evaluated as .

      The GLM method with binning (e.g. used in Kunz et al., 2015) uses the same procedure as the GLM method for estimating the grid orientation in the first half of the data, but the amplitude is estimated differently on the second half by a regressor that has a value 1 if θt is aligned with a peak of the hexadirectional modulation (aligned if , modulo operator) and a value of -1 if θt is misaligned. The hexasymmetry is then calculated from the amplitude in the same way as in the GLM method.

      The circular-linear correlation method (e.g. used in Maidenbaum et al., 2018) is similar to the GLM method in that it uses the regressors β1 cos(6θ_t) and β2 on the time-discrete mean activity, but instead of using β1 and β2 to estimate the orientation of the hexadirectional modulation, the beta values are directly used to estimate the hexasymmetry using the relation .”

      For each of the three previously used metrics and our new method, we estimated the resulting hexasymmetry (new Figure 5 – figure supplement 4 in the manuscript). In the Methods section “Implementation of previously used metrics” we then continue with our explanations: “Regarding the statistical evaluation, each method evaluates the size of the neural hexasymmetry differently. Specifically, the new method developed in our manuscript compares the neural hexasymmetry to path hexasymmetry to test whether neural hexasymmetry is significantly above path hexasymmetry. For the two generalized linear model (GLM) methods, we compare the hexasymmetry to zero (using the Mann-Whitney U test) to establish significance. Hexasymmetry values can be negative in these approaches, allowing the statistical comparison against 0. Negative values occur when the estimated grid orientation from the first data half does not match the grid orientation from the second data half. Regarding the statistical evaluation of the circular-linear correlation method, we calculated a z-score by comparing each empirical observation of the hexasymmetry to hexasymmetries from a set of surrogate distributions (as in Maidenbaum et al., 2018). We then calculate a p-value by comparing the distribution of z-scores versus zero using a Mann-Whitney U test. We use the z-scores instead of the hexasymmetry for the circular-linear correlation method to match the procedure used in Maidenbaum et al. (2018). We obtained the surrogate distributions by circularly shifting the vector of movement directions relative to the time dependent vector of firing rates. For random walks, the vector is shifted by a random number drawn from a uniform distribution defined with the same length as the number of time points in the vector of movement directions. For the star-like walks and piecewise linear walks, the shift is a random integer multiplied by the number of time points in a linear segment. Circularly shifting the vector of movement directions scrambles the correlations between movement direction and neural activity while preserving their temporal structure.”

      The results of these simulations, i.e. the comparison of our new method to previously used metrics, are summarized in Figure 5 – figure supplement 4 and show qualitatively identical findings when using the different methods. We have added this information also to the manuscript in the third paragraph of section “Quantification of hexasymmetry of neural activity and trajectories” of the Methods: “Empirical (fMRI/iEEG) studies (e.g. Doeller et al., 2010; Kunz et al., 2015; Maidenbaum et al., 2018) addressed this problem of trajectories spuriously contributing to hexasymmetry by fitting a Generalized Linear Model (GLM) to the time discrete fMRI/iEEG activity. In contrast, our new approach to hexasymmetry in Equation (12) quantifies the contribution of the path to the neural hexasymmetry explicitly, and has the advantage that it allows an analytical treatment (see next section). Comparing our new method with previous methods for evaluating hexasymmetry led to qualitatively identical statistical effects (Figure 5 – figure supplement 4).” We have also added a pointer to this new supplementary figure in the caption of Figure 5 in the manuscript: “For a comparison between our method and previously used methods for evaluating hexasymmetry, see Figure 5 – figure supplement 4.”

    1. Author Response:

      Reviewer #1 (Public Review):

      Two important goals in evolutionary biology are (i) to understand why different species exhibit different levels of genetic diversity and (ii) in each species, what is the evolutionary nature of genetic variants. Are genetic variants mostly neutral, deleterious, or advantageous? In their study, Stolyarova et al. looked at one of the most polymorphic species known, the fungus Schizophyllum commune. They found that in this hyperpolymorphic species, the evolutionary forces that govern and structure genetic variation can be very different compared to less polymorphic species, including humans and flies. Specifically, the authors find that a process known as positive epistasis is quantitatively abundant among genetic variants that alter proteins in S. commune. Positive epistasis happens when a combination of multiple genetic variants is advantageous for the individuals that carry them, even though each isolated variant in the combination is not advantageous or even detrimental on its own. The authors explain that this happens frequently in their hyperpolymorphic species because the very high polymorphism level makes it very likely that the genetic variants will by chance occur together in the same individuals. In less polymorphic species, the variants that are advantageous in combination may have to wait for each other to occur for too long, for the combination to ever happen often enough in the first place.

      Overall I had a great time reading the manuscript, and I feel that my understanding of evolution has been advanced on a fundamental level after reading it. However part of the reason why I enjoyed it was having to fill the gaps, answer the riddles left unanswered in the story by the authors.

      Strengths:

      1) The model, both extremely polymorphic and amenable to haploid cultures, is ideal to address the questions asked.

      2) The study potentially represents a very important conceptual advance on the way to better understand genetic variation in general.

      3) The interpretations made by the authors of their data are likely the correct ones to make, even though more definitive answers will likely only come from the sequencing of a much larger number of haplotypes, which cannot reasonably be asked of the authors at this point.

      Weaknesses:

      1) The manuscript does not provide enough information to judge if the synonymous controls that are compared to the nonsynonymous variants are fully adequate. Specifically, I have one concern that the Site Frequency Spectrum (SFS) of the synonymous variants at MAF>0.05 may be very different compared to the SFS of nonsynonymous variants at MAF>0.05. I focus on this because the authors mention page 5 line 3: "The excess of LDnonsyn over LDsyn corresponds to the attraction between rare alleles at nonsynonymous sites". First, it is unclear from this or from the figures at this point in the manuscript what the authors mean by rare alleles, among those alleles at MAF>0.05. This needs to be detailed quantitatively much more carefully. Second, and most importantly, this raises the question of whether or not the synonymous controls have a SFS with many less rare (but with MAF>0.05) alleles, as one may expect if they are under less purifying selection than nonsynonymous variants. This then raises the question of whether or not the synonymous control conducted by the author is adequate, or if the authors need to explicitly match the synonymous control in terms of SFS for MAF>0.05 in addition to the distance matching already done.

      We thank the reviewer for this important comment. In page 5 line 3 we meant “the attraction between minor alleles”. In order to avoid confusion between SNPs with low MAF (“rare”) and minor variants at these polymorphic sites (“minor” ) we replaced “rare alleles” with “minor alleles” where appropriate.

      The attraction between minor alleles in nonsynonymous polymorphic sites in S. commune holds if we pool all SNPs together, as is shown in Figure 2 - supplementary figure 4. Following the reviewer’s suggestion, we performed an additional analysis of LD between frequency-matched synonymous and nonsynonymous pairs of SNPs. Specifically, for each possible minor allele count and nucleotide distance, we calculated the number of corresponding pairs of nonsynonymous SNPs and subsampled the same number of synonymous SNPs with the same minor allele count and nucleotide distance. Such subsampling with exact matching of both MAFs and distance shows that LDnonsyn is elevated as compared to LDsyn in both S. commune populations (Figure 2 - figure supplement 3 of the revised version of the manuscript).

      2) The manuscript is far too succinct on several occasions, where observations or interpretations need to be much more detailed and explained.

      We revised the manuscript for clarity, as detailied below.

      Reviewer #2 (Public Review):

      Stolyarova et al. used a highly polymorphic species, Schizophyllum commune, to explore patterns of LD between nonsynonymous and synonymous mutations within protein-coding genes. LD is informative about interference and interactions between selected loci, with compensatory mutations expected to be in strong positive LD. The benefit of studying this fungal species with large diversity (with pi > 0.1) is that populations are able to explore relatively large regions of the fitness landscape, and chances increase that sets of epistatically interacting mutations segregate at the same time.

      This study finds strong positive LD between pairs of nonsynonymous mutations within, but not between genes, compared to pairs synonymous variants. Further, the authors show that high LD is prevalent among pairs of mutations at amino acid sites that interact within the protein. This result is consistent with pairs or sets of compensatory nonsynonymous mutations cosegregating within protein-coding genes.

      The conclusions of this paper are largely supported by the data, with some caveats, listed below.

      1) With such large pairwise diversity, there are bound to be many deleterious variants segregating at once, and the large levels of interference between them will make selection much less efficient at purging deleterious variants.

      We agree that simultaneous segregation of multiple deleterious nonsynonymous variants in the linked locus impedes their elimination by negative selection. However, stronger Hill-Robertson interference cannot result in the observed excess of LDnonsyn. Generally, Hill-Robertson interference decreases LDnonsyn, especially under low recombination rate (Hill and Robertson, 1966; Comeron et al., 2008; Garcia and Lohmueller, 2021). We discuss this in Appendix 2 (Supplementary Note 2 in the old version of the manuscript) and reproduce the effect in simulations.

      While the authors argue that balancing selection is needed to account for patterns of haplotype variation they see, widespread balancing selection may not be required in this setting, and soft or partial selective sweeps (either on single mutations or sets of mutations) can also lead to patterns of diversity where a small number of haplotypes are each at appreciable frequency.

      Although partial sweeps can indeed elevate LD in the linked locus, they aren’t expected to cause the excess of LDnonsyn observed in the haploblocks. In order to show this, we now simulated partial sweeps with and without epistasis. In the hard sweep model, a new beneficial mutation (s=0.5) was introduced in the population. In the soft sweep model, the beneficial mutation was picked from standing variation: selection coefficient of an initially neutral variant with frequency > 5% was changed to 0.5. In both cases, simulations were stopped when beneficial mutation achieves frequency 0.5. Both hard and soft partial sweeps increase LD as compared to simulations without sweeps (Figure R1A,B below). However, even in the presence of pairwise epistasis they don’t result in LDnonsyn > LDsyn (Figure R1C,D).

      Figure R1. Patterns of LD in simulations with partial selective sweeps. Errorbars show the 95% confidence intervals obtained in 100 simulations. Simulation parameters and epistasis models are the same as described in Figure 3 - figure supplement 6.

      Additionally, sweeps are expected to decrease nucleotide diversity in the linked region. However, nucleotide diversity within haploblocks observed in S. commune populations isn’t lower than in the non-haploblocks regions (Figure R2), arguing that the observed patterns can’t be caused by selective sweeps.

      Figure R2. Nucleotide diversity in haploblocks in S. commune populations. Histograms show nucleotide diversity within haploblocks, solid black line shows the average nucleotide diversity in haploblocks. Dashed line shows the average nucleotide diversity in the non-haploblock regions.

      There is also a tension between arguing that balancing selection is widespread and that shared SNPs across populations are expected to arise through recurrent mutation, as balancing selection is known to preserve haplotypes over long evolutionary times. In that section of the discussion especially, I had difficulty following the logic, and some statements are presented more definitively than might be warranted.

      Although we find that balancing selection (either negative frequency-dependent selection or associative overdominance) maintains haploblocks for a long time within S. commune populations, haploblocks aren’t conserved between the two populations, as mentioned in the manuscript. Perhaps this is because balancing selection has had ample time to change on such large evolutionary scales (genetic difference between two S. commune populations is > 0.3 dS), making the fraction of identical by descent polymorphisms in the two populations low. Therefore, the SNPs that are shared between populations most probably arise by recurrent mutations, rather than descending from the ancestral population. We now clarify this in the main text.

      Meanwhile, correlation of LDs between such shared SNPs in the two populations within genes indicates shared epistatic constraints between these populations. Such correlation is seen not because pairs of SNPs are maintained from the ancestral S. commune population, but because epistatic pairs are more likely to be under high LD in both modern populations.

      2) The validations through simulation are somewhat meagre, and I am not convinced that the simulations cover the appropriate parameter regimes. With a population size of 1000, this represents a severe down-scaling of population size and up-scaling of mutation, selection, and recombination rates (if > 0), and it's unclear if such aggressive scaling puts the simulations in an interference/interaction regime far from the true populations.

      Scaling was performed according to SLiM3.0 manual in order to impove calculation time for simulations of highly diverse populations. To address the Reviewer’s concern, we now also check that this approach gives the same results as scaling of N instead of μ, as long as we scale selection coefficient s to maintain Ns and simulate for 100N generations to achieve mutation-selection equilibrium. This is indeed the case for 4Nμ up to 0.05 (Figure R3). We didn’t perform simulations for larger 4Nμ because of extremely long calculation time for large N.

      Figure R3. Simulations of populations with varying nucleotide diversity scaled by population size or mutation rate. (A) nucleotide diversity, (B) linkage disequilibrium for synonymous (s = 0) and nonsynonymous (2Ns = -1) polymorphisms. In simulations with scaled population size, mutation rate μ = 5e-7 and N is scaled to achieve 4Nμ equal to 0.002, 0.01 and 0.05. In simulations with scaled mutation rate, N = 1000 and μ is scaled accordingly. Simulations are performed for 100N generations. Filled areas show 95% confidence intervals calculated for 50 simulations with 4Nμ = 0.05; 250 simulations with 4Nμ = 0.01 and 1000 simulations with 4Nμ = 0.002.

      A selection coefficient of -0.01 also implies 2Ns = -20, whereas Hill-Robertson interference is most pronounced between mutations with 2Ns ~ -1.

      We performed additional simulations of evolution in a highly polymorphic population (4Nμ = 0.2) with nonsynonymous mutations under selection coefficient -5e-4 (2Ns = -1) and varying recombination rate. Consistent with the studies showing that the Hill-Robertson interference results in repulsion of deleterious variants (Hill and Robertson, 1966; Comeron et al., 2008; Garcia and Lohmueller, 2021), in our simulations, LDnonsyn is lower that LDsyn for all recombination rates (Appendix 2 - figure 4). We now append these results to Appendix 2.

      3) Large portions of the genome (8.4 and 15.9%, depending on the population) are covered by haploblocks, which are originally detected as genomic windows with elevated LD among SNPs. It's therefore unsurprising that haploblocks identified as high-LD outliers have elevated LD compared to other regions of the genome, and the discussion about the importance of haploblocks seemed a bit circular.

      Haploblocks are surprising in two ways. Firstly, the existence of haploblocks by itself is indicative of balancing selection allowing two divergent haplotypes to persist within the population for a long time. Secondly, the strongest excess of LDnonsyn over LDsyn is oberved in genes with high LD, i.e. the ones partially or fully falling within haploblock regions (Figure 3). Positive correlation of LD and excess of LDnonsyn indicates that epistasis is more efficient in regions of high LD (haploblocks), so that the strong attraction between nonsynonymous variants observed in S. commune results from interaction between epistasis and balancing selection. We now reformulated the corresponding results section to make this clearer. We also discuss the interaction between balancing selection and epistasis in the discussion section of the manuscript.

      4) Finally, the authors observe a positive correlation between Pn/Ps and LD between both synonymous and nonsynonymous mutations. This result is intriguing and should be discussed, but the authors do not comment on this result in the Discussion.

      Positive correlation between pn/ps, LD and the excess of LDnonsyn can be caused by multiple mechanisms, such as positive epistasis weakening the action of negative selection on nonsynonymous variants, or differences in the efficiacy of epistatic and non-epistatic selection for alleles under different allele frequency or local recombination rate. We now add the discussion on the interaction between pn/ps, LD and the excess of LDnonyn to the corresponding Results section.

    1. Author Response

      Reviewer #1 (Public Review):

      A clear strength of the present manuscript is its scientific rigor. The authors put a lot of emphasis on transparent reporting and pre-registered their hypotheses. The within-person experimental design is well constructed and deals upfront with several potential confounds. All in all, the experimental design allowed a replication and extension of findings related to evoked neural responses due to auditory presentation during sleep. Nevertheless, the exact neural mechanisms that should drive sleepdependent learning gains due to reactivation remain elusive. In part this is due to analytical choices - especially with regard to the phase-amplitude coupling analyses. For example, it remains to be established that there is a reliable coupling of SOs and SPs before any condition specific analyses appear appropriate.

      We thank the reviewer for these constructive remarks. We acknowledge that the description of the phase-amplitude coupling analyses lacked details in the initial submission and we therefore clarified the approach in the revised manuscript. Moreover, we followed the suggestion of the reviewer and performed additional analyses to test for coupling within each stimulation condition and at rest separately. Briefly, the results show a reliable coupling between the phase of the slow oscillations and the amplitude of the signal in the sigma band irrespective of the stimulation condition. These results are reported in Supplemental Figure S5 of the revised submission.

      Reviewer #2 (Public Review):

      The work by Nicolas et al. investigates neurophysiological processes in response to sound cues delivered during sleep. Importantly, the presented sound cues were previously associated with a motor sequence participants had to practice. By presenting the sound cues during sleep, performance in pressing the motor sequence was increased (targeted memory reactivation, TMR). At the neural level, presenting sound cues associated with a motor sequence resulted in a higher amplitude (of the evoked response as well as of spontaneous slow waves) than presenting sound cues without any association. Further, the precise interplay between slow and sigma oscillations correlated with the behavioural TMR benefit.

      This finding is of high interest. However, some aspects of the analyses have to be clarified and the interpretation of sigma oscillations protecting motor memory (by being nested in the trough of the slow oscillation peak) has to be more substantiated by further results.

      Strengths: The study is elegantly designed (within-subjects design) and allows for testing the proposed hypotheses. The study as a sleep study is well controlled for example by incorporating a habituation nap, by using actigraphy during three nights before the learning nap and by measuring vigilance objectively as well as subjectively.

      One of the biggest strengths of the study is its pre-registration. The authors did not just pre-registered the study but moreover highlight and justify any deviation from the pre-registration and state whether an analysis was planned or exploratory. Thus, the whole research process is very transparent and plausible.

      We thank the reviewer for these constructive and positive remarks. We acknowledge that some aspects of the analyses lacked details in the initial submission and we therefore clarified the approach in the revised manuscript. Additionally, we have thoroughly considered the reviewer’s suggestions with respect to the analyses and interpretation of the sigma oscillations data (see response to comment #2 below).

      Weaknesses: The interpretation of sigma oscillations protecting motor memories (i.e., sigma power towards unassociated sound cues is increased in the trough of an evoked potential) is not very well substantiated by the results.

      We thank the reviewer for giving us the opportunity to further examine the role of sigma oscillations (and their coupling with slow oscillations) in the protective processes discussed in the manuscript. Our results indeed suggest that when a control, unknown cue is presented to the sleeping brain, it might trigger protective mechanisms to prevent these “irrelevant” sensory stimuli to be processed and therefore disturb the ongoing consolidation process. Specifically, we speculated that SW-sigma coupling during exposure to unassociated sounds might prevent sound processing which would in turn be reflected by a decrease in the amplitude of the slow electrophysiological responses (i.e., smaller ERP and SWs) during non-associated sound intervals. In order to further examine this possibility, we performed exploratory analyses testing for potential relationships between the eventrelated phase-amplitude coupling (ERPAC) observed on unassociated conditions and slow electrophysiological responses (i.e., ERP and SWs). To do so, we extracted the ERPAC value during unassociated stimulation intervals in the time-frequency window where ERPAC was significantly greater for unassociated as compared to associated and rest conditions (i.e. from -0.5 to 0.5 sec and from 14 to 18 Hz, see Figure 6 in the main text). While the ERPAC during unassociated intervals did not correlate with the amplitude of the unassociated ERPs, it correlated negatively with the properties of the SWs detected during unassociated intervals. Specifically, the higher the ERPAC, the lower SW density (t = 2.9, df = 20, p-value = 0.004) and peak-to-peak amplitude (S = 2460, p-value = 0.037) during unassociated intervals. These analyses, albeit exploratory, provide further support to the protective mechanism discussed in the initial version of the manuscript. These results are now reported in the supplemental information (Supplemental Figure S9) and mentioned in the revised discussion to further substantiate the hypothesized protective mechanism (see p. 13, l. 46 of the revised manuscript).

      The motivation for some analysis decisions is not always clear. To highlight one example, it is unclear why the authors average the data across channels. Previous findings demonstrate that slow oscillations and sleep spindles vary across the scalp (Klinzing et al. (2016), Cox et al. (2017)). Thus, averaging across all channels potentially introduces more noise.

      We apologize for the lack of justification concerning the averaging procedures in the original manuscript. We now explain in the revised manuscript the motivation for averaging data across channels in our different analyses (see pages 21 and 23). Briefly, as our montage did not allow fine topographical analyses (only 6 EEG channels), we opted to average data across channels in order to decrease the dimensionality of the data. However, we agree with the reviewer that reporting channel level data is important. Therefore, for each analysis presented in the main text, the corresponding channel-level results are reported in the supplements (i.e., ERPs are shown in Supplemental Figure S2 and S4, correlation between targeted memory reactivation index and power modulation is depicted in Supplemental Figure S7, PAC difference at the negative peak of the SW is in Supplemental Figure S6 and PAC/TMR index correlation in Figure S8). Altogether, channel level data revealed that central – and to a lesser extent frontal - electrodes mainly contributed to the pattern of results revealed with averaged data reported in the main text.

      The description of some methods has to be more precise (for example the detection of slow waves and sleep spindles and specifically the phase coupling).

      We thank the reviewer for pointing that out. We have now revised the manuscript to provide the necessary details on the detection algorithms (Vallat & Walker, 2021) as well as on the event-related phase-amplitude coupling method (Voytek et al., 2013, Combrisson et al., 2020). We invite the reviewer to consult the responses to comments #13 and #16 below for detailed responses to these points.

      Reviewer #3 (Public Review):

      Nicolas et al. performed a nap study in healthy humans to examine the temporal dynamics of sleep oscillations during procedural memory consolidation. To this end, the authors used targeted memory reactivation (TMR) to re-expose participants during a nap to a sound cue previously associated with a finger tapping sequence. As control conditions serve (i) a second encoded sequence with a sound that is not played during sleep, (ii) a novel control cue not heard during prior wakefulness and (iii) so-called rest-periods during which no cueing was performed. Behaviorally, the authors confirm the beneficial effect of TMR as participants perform better (faster) on the reactivated sequence in comparison to the not-reactivated sequence after their nap and even after an additional night spent at home.

      Electroencephalography recordings acquired during the nap then revealed that TMR cues evoked stronger responses than control cues hinting a distinct processing of familiar and memory-related cues. This is supported by a general analysis 0.5 to 2 Hz slow waves, one fundamental sleep oscillation linked to memory consolidation, which showed higher densities during intervals of real-cueing. Interestingly, the density of 12-16 Hz sleep spindles was not influenced, however, their frequency decreased and amplitude increased. Finally, the authors assessed the coupling between slow waves and sleep spindles, which rather counter-intuitively showed an increased coupling during intervals cued with control sounds. Moreover, the stronger this coupling the higher the TMR benefit.

      Altogether, this data revealed an interesting slow wave-spindle dynamic underlying the processing of familiar and unfamiliar auditory cues and scrutinizes how these brain rhythms mediate memory consolidation

      Overall, this is a very well-designed experiment and I salute that it has been pre-registered and how transparent everything has been reported. Moreover, the utilization of a control sound during sleep is currently rarely taken advantage of during TMR study, while they can add important insights. While the analysis pipeline is appropriate and well-rounded, some aspects need to be clarified and extended.

      We would like to thank the reviewer for the time devoted to our manuscript and for the constructive comments about our work. We provide below detailed answers to the points raised by the reviewer.

      Response to control sounds. It is very surprising that the response to control sounds is, apart from an early evoked component around 100 ms, almost nonexistent. Auditory stimuli are overall known to normally evoke K-complexes and strong spindle responses. Could it be that for some reason control sounds were lower in volume or do they lead to a stronger habituation? Control analysis might help to ensure that there is really no confusion. For example, ERP at the beginning and end of each stimulation interval could be contrasted. Moreover, the authors state that sound cues were balanced across subjects. However, they also state that the volume was adapted for each sound individually. Additional data or statistics on these volumes, randomization and cued slow wave phase might be very helpful.

      We thank the reviewer for raising this point and for giving us the opportunity to elaborate on these aspects. The sound volume was indeed adjusted based on the perception level of each sound for each individual. As pointed out by the reviewer, this resulted in different absolute volumes for each sound and individual; however, all sounds were presented at the same percentage of detection thresholds across participants. Moreover, as the sound / condition associations were perfectly balanced in our experiment (each sound was associated to each condition 8 times), differences in sound volume - or frequency – cannot explain our pattern of results.

      Further, inspection of the ERP at the individual channel level (cf. Supplemental Figure S2) revealed that unassociated auditory cues can indeed elicit negative peak on some channels (Fz and C3 to a lesser extent). We invite the reviewer to refer to our response to comment #12 of reviewer #2 for a comparison with the relevant literature.

      In order to address the comment of the reviewer on potential habituation effects, we performed exploratory analyses on a subset of events. Specifically, we compared the ERPs computed across the 30 first vs. the 30 last cues presented during the nap within each condition (see Figure 1 below). CBP did not reveal any difference between early and late nap ERPs in any conditions (all p-values > 0.2). Importantly, the results observed within the unassociated condition are similar to what is reported in the main text across all trials. Altogether, these analyses suggest that the weaker responses to the unassociated sound are not due to habituation processes.

      Figure 1: Event-related Potentials early vs. late nap. Group average (and standard error) of potentials evoked by the 30 first (grey) and the 30 last (black) auditory cues of the nap from cue onset to 2.5 sec post-cue averaged across participants (left: associated cues; right: unassociated cues). CBP did not show any early vs. late differences in ERPs in any conditions.

      Last, with respect to the point on cued slow wave phase, we extracted the phase of the slow oscillation (0.5-2Hz) at which the auditory cues were sent in each condition separately (see Figure 2 below). We then tested whether the phases differed using Watson-Williams multi-sample test for equal means (Berens, 2009). Results showed no difference between the two conditions (F(1,46)= 0.6, p-value = 0.8), suggesting that the effects reported in the main text were not confounded by this factor.

      Figure 2: Phase of slow oscillation at stimulation. Phase in degrees of the SO at the associated (magenta) or unassociated (yellow) auditory cues.

      Discrete slow wave analysis. It is reported that the offline detection of slow waves yielded identical numbers across conditions, but this contradicts the later reported differences in densities. If this is true, it implies that the total time during which real cues and control cues were presented as well as the cueing paused (i.e., the rest intervals) differs within subjects. It needs to be ensured that effective stimulation times are comparable between subjects and are not confounded by unfair comparisons.

      There might be a misunderstanding on this point, as we did not compare the number of SWs between conditions but only SW density and amplitude. We assume that the reviewer is referring to the number of auditory cues sent during NREM that were indeed not different across conditions.

      Statistical results. Consistently across all cluster-based statistics, significant clusters somehow do not reflect the underlying colormaps. One would expect that significances are driven by clusters of greatest difference (Figure 6B and C). That something might be amiss, is reflected in the statement that a contrast of TFRs for real and control cues revealed no significant cluster, although this contrast shown in Figure 7a clear depicts two cluster with strong power differences (before 500 ms around 8 Hz, and after 500 ms around 20 Hz).

      Moreover, follow-up analysis revolving around sleep spindles are based on inconsistent frequency ranges. For one analysis a prior significant cluster is used (Figure 8) while for the other it is limited to 12- 16 Hz and a much shorter time window than the overall cluster (Figure 7), even in the pre-registered 1216 Hz window. Overall, these analyses should be checked and streamlined.

      We agree with the reviewer that time-frequency representations (TFR) of results can somehow be misleading as inter-subject variability is not represented. As such, clusters showing e.g. a high difference in PAC between conditions but also high inter-subject variability would be represented with warm colors in the TFR but would not be highlighted by the CBP statistics (as seen for example in Figure 6B and C). Instead, what is highlighted by CBP are effects that are consistent across participants and these effects can indeed be of lower amplitude in some cases.

      Concerning Figure 7, the initial time-frequency plot presented the power difference between conditions that was subsequently correlated with the TMR index while the statistical cluster showed the results of the correlation. As this was indeed confusing (see also our response to comment #10 below and to comments #26 and #27 of reviewer #2), we now show the rho values issued from the correlation between the power difference and the TMR index. We thank the reviewer for pointing this out, as the new representation improved the readability of the figure.

      Last, we want to thank the reviewer for pointing out the discrepancy regarding the procedure used to extract the data for the scatter plots shown in panel B of Figures 7 and 8 (referred to as “follow-up analyses” by the reviewer). We now extract the values in the significant clusters included in the preregistered frequency band (12-16 Hz) for both analyses presented in Figures 7 and 8. It is worth nothing though that this procedure was only used for illustration purposes and was therefore not a formal follow-up analysis. We acknowledge that the p-values displayed on the panel B plots of the original figures might be misleading with that regard, thus they were removed in the revised manuscript.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript will interest cognitive scientists, neuroimaging researchers, and neuroscientists interested in the systems-level organization of brain activity. The authors describe four brain states that are present across a wide range of cognitive tasks and determine that the relative distribution of the brain states shows both commonalities and differences across task conditions.

      The authors characterized the low-dimensional latent space that has been shown to capture the major features of intrinsic brain activity using four states obtained with a Hidden Markov Model. They related the four states to previously-described functional gradients in the brain and examined the relative contribution of each state under different cognitive conditions. They showed that states related to the measured behavior for each condition differed, but that a common state appears to reflect disengagement across conditions. The authors bring together a state-of-the-art analysis of systemslevel brain dynamics and cognitive neuroscience, bridging a gap that has long needed to be bridged.

      The strongest aspect of the study is its rigor. The authors use appropriate null models and examine multiple datasets (not used in the original analysis) to demonstrate that their findings replicate. Their thorough analysis convincingly supports their assertion that common states are present across a variety of conditions, but that different states may predict behavioural measures for different conditions. However, the authors could have better situated their work within the existing literature. It is not that a more exhaustive literature review is needed-it is that some of their results are unsurprising given the work reported in other manuscripts; some of their work reinforces or is reinforced by prior studies; and some of their work is not compared to similar findings obtained with other analysis approaches. While space is not unlimited, some of these gaps are important enough that they are worth addressing:

      We appreciate the reviewer’s thorough read of our manuscript and positive comments on its rigor and implications. We agree that the original version of the manuscript insufficiently situated this work in the existing literature. We have made extensive revisions to better place our findings in the context of prior work. These changes are described in detail below.

      1) The authors' own prior work on functional connectivity signatures of attention is not discussed in comparison to the latest work. Neither is work from other groups showing signatures of arousal that change over time, particularly in resting state scans. Attention and arousal are not the same things, but they are intertwined, and both have been linked to large-scale changes in brain activity that should be captured in the HMM latent states. The authors should discuss how the current work fits with existing studies.

      Thank you for raising this point. We agree that the relationship between low-dimensional latent states and predefined activity and functional connectivity signatures is an important and interesting question in both attention research and more general contexts. Here, we did not empirically relate the brain states examined in this study and functional connectivity signatures previously investigated in our lab (e.g., Rosenberg et al., 2016; Song et al., 2021a) because the research question and methodological complexities deserved separate attention that go beyond the scope of this paper. Therefore, we conceptually addressed the reviewer’s question on how functional connectivity signatures of attention are related to the brain states that were observed here. Next, we asked how arousal relates to the brain states by indirectly predicting arousal levels of each brain state based on its activity patterns’ spatial resemblance to the predefined arousal network template (Goodale et al., 2021).

      Latent states and dynamic functional connectivity

      Previous work suggested that, on medium time scales (~20-60 seconds), changes in functional connectivity signatures of sustained attention (Rosenberg et al., 2020) and narrative engagement (Song et al., 2021a) predicted changes in attentional states. How do these attention-related functional connectivity dynamics relate to latent state dynamics, measured on a shorter time scale (1 second)?

      Theoretically, there are reasons to think that these measures are related but not redundant. Both HMM and dynamic functional connectivity provide summary measures of the whole-brain functional interactions that evolve over time. Whereas HMM identifies recurring low-dimensional brain states, dynamic functional connectivity used in our and others’ prior studies captures high-dimensional dynamical patterns. Furthermore, while the mixture Gaussian function utilized to infer emission probability in our HMM infers the states from both the BOLD activity patterns and their interactions, functional connectivity considers only pairwise interactions between regions of interests. Thus, with a theoretical ground that the brain states can be characterized at multiple scales and different methods (Greene et al., 2023), we can hypothesize that the both measures could (and perhaps, should be able to) capture brain-wide latent state changes. For example, if we were to apply kmeans clustering methods on the sliding window-based dynamic functional connectivity as in Allen et al. (2014), the resulting clusters could arguably be similar to the latent states derived from the HMM.

      However, there are practical reasons why the correspondence between our prior dynamic functional connectivity models and current HMM states is difficult to test directly. A time point-bytime point matching of the HMM state sequence and dynamic functional connectivity is not feasible because, in our prior work, dynamic functional connectivity was measured in a sliding time window (~20-60 seconds), whereas the HMM state identification is conducted at every TR (1 second). An alternative would be to concatenate all time points that were categorized as each HMM state to compute representative functional connectivity of that state. This “splicing and concatenating” method, however, disrupts continuous BOLD-signal time series and has not previously been validated for use with our dynamic connectome-based predictive models. In addition, the difference in time series lengths across states would make comparisons of the four states’ functional connectomes unfair.

      One main focus of our manuscript was to relate brain dynamics (HMM state dynamics) to static manifold (functional connectivity gradients). We agree that a direct link between two measures of brain dynamics, HMM and dynamic functional connectivity, is an important research question. However, due to some intricacies that needed to be addressed to answer this question, we felt that it was beyond the scope of our paper. We are eager, however, to explore these comparisons in future work which can more thoroughly address the caveats associated with comparing models of sustained attention, narrative engagement, and arousal defined using different input features and methods.

      Arousal, attention, and latent neural state dynamics

      Next, the reviewer posed an important question about the relationship between arousal, attention, and latent states. The current study was designed to assess the relationship between attention and latent state dynamics. However, previous neuroimaging work showed that low-dimensional brain dynamics reflect fluctuations in arousal (Raut et al., 2021; Shine et al., 2016; Zhang et al., 2023). Behavioral studies showed that attention and arousal hold a non-linear relationship, for example, mind-wandering states are associated with lower arousal and externally distracted states are associated with higher arousal, when both these states indicate low attention (Esterman and Rothlein, 2019; Unsworth and Robison, 2018, 2016).

      To address the reviewer’s suggestion, we wanted to test if our brain states reflected changes in arousal, but we did not collect relevant behavioral or physiological measures. Therefore, to indirectly test for relationships, we predicted levels of arousal in brain states by applying the “arousal network template” defined by Dr. Catie Chang’s group (Chang et al., 2016; Falahpour et al., 2018; Goodale et al., 2021). The arousal network template was created from resting-state fMRI data to predict arousal levels indicated by eye monitoring and electrophysiological signals. In the original study, the arousal level at each time point was predicted by the correlation between the BOLD activity patterns of each TR to the arousal template. The more similar the whole-brain activation pattern was to the arousal network template, the higher the participant was predicted to be aroused at that moment. This activity pattern-based model was generalized to fMRI data during tasks (Goodale et al., 2021).

      We correlated the arousal template to the activity patterns of the four brain states that were inferred by the HMM. The DMN state was positively correlated with the arousal template (r=0.264) and the SM state was negatively correlated with the arousal template (r=-0.303) (Author response image 1). These values were not tested for significance because they were single observations. While speculative, this may suggest that participants are in a high arousal state during the DMN state and a low arousal state during the SM state. Together with our results relating brain states to attention, it is possible that the SM state is a common state indicating low arousal and low attention. On the other hand, the DMN state, a signature of a highly aroused state, may benefit gradCPT task performance but not necessarily in engaging with a sitcom episode. However, because this was a single observation and we did not collect a physiological measure of arousal to validate this indirect prediction result, we did not include the result in the manuscript. We hope to more directly test this question in future work with behavioral and physiological measures of arousal.

      Author response image 1.

      Changes made to the manuscript

      Importantly, we agree with the reviewer that a theoretical discussion about the relationships between functional connectivity, latent states, gradients, as well as attention and arousal was a critical omission from the original Discussion. We edited the Discussion to highlight past literature on these topics and encourage future work to investigate these relationships.

      [Manuscript, page 11] “Previous studies showed that large-scale neural dynamics that evolve over tens of seconds capture meaningful variance in arousal (Raut et al., 2021; Zhang et al., 2023) and attentional states (Rosenberg et al., 2020; Yamashita et al., 2021). We asked whether latent neural state dynamics reflect ongoing changes in attention in both task and naturalistic contexts.”

      [Manuscript, page 17] “Previous work showed that time-resolved whole-brain functional connectivity (i.e., paired interactions of more than a hundred parcels) predicts changes in attention during task performance (Rosenberg et al., 2020) as well as movie-watching and story-listening (Song et al., 2021a). Future work could investigate whether functional connectivity and the HMM capture the same underlying “brain states” to bridge the results from the two literatures. Furthermore, though the current study provided evidence of neural state dynamics reflecting attention, the same neural states may, in part, reflect fluctuations in arousal (Chang et al., 2016; Zhang et al., 2023). Complementing behavioral studies that demonstrated a nonlinear relationship between attention and arousal (Esterman and Rothlein, 2019; Unsworth and Robison, 2018, 2016), future studies collecting behavioral and physiological measures of arousal can assess the extent to which attention explains neural state dynamics beyond what can be explained by arousal fluctuations.”

      2) The 'base state' has been described in a number of prior papers (for one early example, see https://pubmed.ncbi.nlm.nih.gov/27008543). The idea that it might serve as a hub or intermediary for other states has been raised in other studies, and discussion of the similarity or differences between those studies and this one would provide better context for the interpretation of the current work. One of the intriguing findings of the current study is that the incidence of this base state increases during sitcom watching, the strongest evidence to date is that it has a cognitive role and is not merely a configuration of activity that the brain must pass through when making a transition.

      We greatly appreciate the reviewer’s suggestion of prior papers. We were not aware of previous findings of the base state at the time of writing the manuscript, so it was reassuring to see consistent findings. In the Discussion, we highlighted the findings of Chen et al. (2016) and Saggar et al. (2022). Both studies highlighted the role of the base state as a “hub”-like transition state. However, as the reviewer noted, these studies did not address the functional relevance of this state to cognitive states because both were based on resting-state fMRI.

      In our revised Discussion, we write that our work replicates previous findings of the base state that consistently acted as a transitional hub state in macroscopic brain dynamics. We also note that our study expands this line of work by characterizing what functional roles the base state plays in multiple contexts: The base state indicated high attentional engagement and exhibited the highest occurrence proportion as well as longest dwell times during naturalistic movie watching. The base state’s functional involvement was comparatively minor during controlled tasks.

      [Manuscript, page 17-18] “Past resting-state fMRI studies have reported the existence of the base state. Chen et al. (2016) used the HMM to detect a state that had “less apparent activation or deactivation patterns in known networks compared with other states”. This state had the highest occurrence probability among the inferred latent states, was consistently detected by the model, and was most likely to transition to and from other states, all of which mirror our findings here. The authors interpret this state as an “intermediate transient state that appears when the brain is switching between other more reproducible brain states”. The observation of the base state was not confined to studies using HMMs. Saggar et al. (2022) used topological data analysis to represent a low-dimensional manifold of resting-state whole-brain dynamics as a graph, where each node corresponds to brain activity patterns of a cluster of time points. Topologically focal “hub” nodes were represented uniformly by all functional networks, meaning that no characteristic activation above or below the mean was detected, similar to what we observe with the base state. The transition probability from other states to the hub state was the highest, demonstrating its role as a putative transition state.

      However, the functional relevance of the base state to human cognition had not been explored previously. We propose that the base state, a transitional hub (Figure 2B) positioned at the center of the gradient subspace (Figure 1D), functions as a state of natural equilibrium. Transitioning to the DMN, DAN, or SM states reflects incursion away from natural equilibrium (Deco et al., 2017; Gu et al., 2015), as the brain enters a functionally modular state. Notably, the base state indicated high attentional engagement (Figure 5E and F) and exhibited the highest occurrence proportion (Figure 3B) as well as the longest dwell times (Figure 3—figure supplement 1) during naturalistic movie watching, whereas its functional involvement was comparatively minor during controlled tasks. This significant relevance to behavior verifies that the base state cannot simply be a byproduct of the model. We speculate that susceptibility to both external and internal information is maximized in the base state—allowing for roughly equal weighting of both sides so that they can be integrated to form a coherent representation of the world—at the expense of the stability of a certain functional network (Cocchi et al., 2017; Fagerholm et al., 2015). When processing rich narratives, particularly when a person is fully immersed without having to exert cognitive effort, a less modular state with high degrees of freedom to reach other states may be more likely to be involved. The role of the base state should be further investigated in future studies.”

      3) The link between latent states and functional connectivity gradients should be considered in the context of prior work showing that the spatiotemporal patterns of intrinsic activity that account for most of the structure in resting state fMRI also sweep across functional connectivity gradients (https://pubmed.ncbi.nlm.nih.gov/33549755/). In fact, the spatiotemporal dynamics may give rise to the functional connectivity gradients (https://pubmed.ncbi.nlm.nih.gov/35902649/). HMM states bear a marked resemblance to the high-activity phases of these patterns and are likely to be closely linked to them. The spatiotemporal patterns are typically obtained during rest, but they have been reported during task performance (https://pubmed.ncbi.nlm.nih.gov/30753928/) which further suggests a link to the current work. Similar patterns have been observed in anesthetized animals, which also reinforces the conclusion of the current work that the states are fundamental aspects of the brain's functional organization.

      We appreciate the comments that relate spatiotemporal patterns, functional connectivity gradients, and the latent states derived from the HMM. Our work was also inspired by the papers that the reviewer suggested, especially Bolt et al.’s (2022), which compared the results of numerous dimensionality and clustering algorithms and suggested three spatiotemporal patterns that seemed to be commonly supported across algorithms. We originally cited these studies throughout the manuscript, but did not discuss them comprehensively. We have revised the Discussion to situate our findings on past work that used resting-state fMRI to study low-dimensional latent brain states.

      [Manuscript, page 15-16] “This perspective is supported by previous work that has used different methods to capture recurring low-dimensional states from spontaneous fMRI activity during rest. For example, to extract time-averaged latent states, early resting-state analyses identified task-positive and tasknegative networks using seed-based correlation (Fox et al., 2005). Dimensionality reduction algorithms such as independent component analysis (Smith et al., 2009) extracted latent components that explain the largest variance in fMRI time series. Other lines of work used timeresolved analyses to capture latent state dynamics. For example, variants of clustering algorithms, such as co-activation patterns (Liu et al., 2018; Liu and Duyn, 2013), k-means clustering (Allen et al., 2014), and HMM (Baker et al., 2014; Chen et al., 2016; Vidaurre et al., 2018, 2017), characterized fMRI time series as recurrences of and transitions between a small number of states. Time-lag analysis was used to identify quasiperiodic spatiotemporal patterns of propagating brain activity (Abbas et al., 2019; Yousefi and Keilholz, 2021). A recent study extensively compared these different algorithms and showed that they all report qualitatively similar latent states or components when applied to fMRI data (Bolt et al., 2022). While these studies used different algorithms to probe data-specific brain states, this work and ours report common latent axes that follow a long-standing theory of large-scale human functional systems (Mesulam, 1998). Neural dynamics span principal axes that dissociate unimodal to transmodal and sensory to motor information processing systems.”

      Reviewer #2 (Public Review):

      In this study, Song and colleagues applied a Hidden Markov Model to whole-brain fMRI data from the unique SONG dataset and a grad-CPT task, and in doing so observed robust transitions between lowdimensional states that they then attributed to specific psychological features extracted from the different tasks.

      The methods used appeared to be sound and robust to parameter choices. Whenever choices were made regarding specific parameters, the authors demonstrated that their approach was robust to different values, and also replicated their main findings on a separate dataset.

      I was mildly concerned that similarities in some of the algorithms used may have rendered some of the inter-measure results as somewhat inevitable (a hypothesis that could be tested using appropriate null models).

      This work is quite integrative, linking together a number of previous studies into a framework that allows for interesting follow-up questions.

      Overall, I found the work to be robust, interesting, and integrative, with a wide-ranging citation list and exciting implications for future work.

      We appreciate the reviewer’s comments on the study’s robustness and future implications. Our work was highly motivated by the reviewer’s prior work.

      Reviewer #3 (Public Review):

      My general assessment of the paper is that the analyses done after they find the model are exemplary and show some interesting results. However, the method they use to find the number of states (Calinski-Harabasz score instead of log-likelihood), the model they use generally (HMM), and the fact that they don't show how they find the number of states on HCP, with the Schaeffer atlas, and do not report their R^2 on a test set is a little concerning. I don't think this perse impedes their results, but it is something that they can improve. They argue that the states they find align with long-standing ideas about the functional organization of the brain and align with other research, but they can improve their selection for their model.

      We appreciate the reviewer’s thorough read of the paper, evaluation of our analyses linking brain states to behavior as “exemplary”, and important questions about the modeling approach. We have included detailed responses below and updated the manuscript accordingly.

      Strengths:

      • Use multiple datasets, multiple ROIs, and multiple analyses to validate their results

      • Figures are convincing in the sense that patterns clearly synchronize between participants

      • Authors select the number of states using the optimal model fit (although this turns out to be a little more questionable due to what they quantify as 'optimal model fit')

      We address this concern on page 30-31 of this response letter.

      • Replication with Schaeffer atlas makes results more convincing

      • The analyses around the fact that the base state acts as a flexible hub are well done and well explained

      • Their comparison of synchrony is well-done and comparing it to resting-state, which does not have any significant synchrony among participants is obvious, but still good to compare against.

      • Their results with respect to similar narrative engagement being correlated with similar neural state dynamics are well done and interesting.

      • Their results on event boundaries are compelling and well done. However, I do not find their Chang et al. results convincing (Figure 4B), it could just be because it is a different medium that explains differences in DMN response, but to me, it seems like these are just altogether different patterns that can not 100% be explained by their method/results.

      We entirely agree with the reviewer that the Chang et al. (2021) data are different in many ways from our own SONG dataset. Whereas data from Chang et al. (2021) were collected while participants listened to an audio-only narrative, participants in the SONG sample watched and listened to audiovisual stimuli. They were scanned at different universities in different countries with different protocols by different research groups for different purposes. That is, there are numerous reasons why we would expect the model should not generalize. Thus, we found it compelling and surprising that, despite all of these differences between the datasets, the model trained on the SONG dataset generalized to the data from Chang et al. (2021). The results highlighted a robust increase in the DMN state occurrence and a decrease in the base state occurrence after the narrative event boundaries, irrespective of whether the stimulus was an audiovisual sitcom episode or a narrated story. This external model validation was a way that we tested the robustness of our own model and the relationship between neural state dynamics and cognitive dynamics.

      • Their results that when there is no event, transition into the DMN state comes from the base state is 50% is interesting and a strong result. However, it is unclear if this is just for the sitcom or also for Chang et al.'s data.

      We apologize for the lack of clarity. We show the statistical results of the two sitcom episodes as well as Chang et al.’s (2021) data in Figure 4—figure supplement 2 in our original manuscript. Here, we provide the exact values of the base-to-DMN state transition probability, and how they differ across moments after event boundaries compared to non-event boundaries.

      For sitcom episode 1, the probability of base-to-DMN state transition was 44.6 ± 18.8 % at event boundaries whereas 62.0 ± 10.4 % at non-event boundaries (FDR-p = 0.0013). For sitcom episode 2, the probability of base-to-DMN state transition was 44.1 ± 18.0 % at event boundaries whereas 62.2 ± 7.6 % at non-event boundaries (FDR-p = 0.0006). For the Chang et al. (2021) dataset, the probability of base-to-DMN state transition was 33.3 ± 15.9 % at event boundaries whereas 58.1 ± 6.4 % at non-event boundaries (FDR-p < 0.0001). Thus, our result, “At non-event boundaries, the DMN state was most likely to transition from the base state, accounting for more than 50% of the transitions to the DMN state” (pg 11, line 24-25), holds true for both the internal and external datasets.

      • The involvement of the base state as being highly engaged during the comedy sitcom and the movie are interesting results that warrant further study into the base state theory they pose in this work.

      • It is good that they make sure SM states are not just because of head motion (P 12).

      • Their comparison between functional gradient and neural states is good, and their results are generally well-supported, intuitive, and interesting enough to warrant further research into them. Their findings on the context-specificity of their DMN and DAN state are interesting and relate well to the antagonistic relationship in resting-state data.

      Weaknesses:

      • Authors should train the model on part of the data and validate on another

      Thank you for raising this issue. To the best of our knowledge, past work that applied the HMM to the fMRI data has conducted training and inference on the same data, including initial work that implemented HMM on the resting-state fMRI (Baker et al., 2014; Chen et al., 2016; Vidaurre et al., 2018, 2017) as well as more recent work that applied HMMs to the task or movie-watching fMRI (Cornblath et al., 2020; Taghia et al., 2018; van der Meer et al., 2020; Yamashita et al., 2021). That is, the parameters—emission probability, transition probability, and initial probability—were estimated from the entire dataset and the latent state sequence was inferred using the Viterbi algorithm on the same dataset.

      However, we were also aware of the potential problem this may have. Therefore, in our recent work asking a different research question in another fMRI dataset (Song et al., 2021b), we trained an HMM on a subset of the dataset (moments when participants were watching movie clips in the original temporal order) and inferred latent state sequence of the fMRI time series in another subset of the dataset (moments when participants were watching movie clips in a scrambled temporal order). To the best of our knowledge, this was the first paper that used different segments of the data to fit and infer states from the HMM.

      In the current study, we wanted to capture brain states that underlie brain activity across contexts. Thus, we presented the same-dataset training and inference procedure as our primary result. However, for every main result, we also showed results where we separated the data used for model fitting and state inference. That is, we fit the HMM on the SONG dataset, primarily report the inference results on the SONG dataset, but also report inference on the external datasets that were not included in model fitting. The datasets used were the Human Connectome Project dataset (Van Essen et al., 2013), Chang et al. (2021) audio-listening dataset, Rosenberg et al. (2016) gradCPT dataset, and Chen et al. (2017) Sherlock dataset.

      However, to further address the concern of the reviewer whether the HMM fit is reliable when applied to held-out data, we computed the reliability of the HMM inference by conducting crossvalidations and split-half reliability analysis.

      (1) Cross-validation

      To separate the dataset used for HMM training and inference, we conducted cross-validation on the SONG dataset (N=27) by training the model with the data from 26 participants and inferring the latent state sequence of the held-out participant.

      First, we compared the robustness of the model training by comparing the mean activity patterns of the four latent states fitted at the group level (N=27) with the mean activity patterns of the four states fitted across cross-validation folds. Pearson’s correlations between the group-level vs. cross-validated latent states’ mean activity patterns were r = 0.991 ± 0.010, with a range from 0.963 to 0.999.

      Second, we compared the robustness of model inference by comparing the latent state sequences that were inferred at the group level vs. from held-out participants in a cross-validation scheme. All fMRI conditions had mean similarity higher than 90%; Rest 1: 92.74 ± 5.02 %, Rest2: 92.74 ± 4.83 %, GradCPT face: 92.97 ± 6.41 %, GradCPT scene: 93.27 ± 5.76 %, Sitcom ep1: 93.31 ± 3.92 %, Sitcom ep2: 93.13 ± 4.36 %, Documentary: 92.42 ± 4.72 %.

      Third, with the latent state sequences inferred from cross-validation, we replicated the analysis of Figure 3 to test for synchrony of the latent state sequences across participants. The crossvalidated results were highly similar to manuscript Figure 3, which was generated from the grouplevel analysis. Mean synchrony of latent state sequences are as follows: Rest 1: 25.90 ± 3.81%, Rest 2: 25.75 ± 4.19 %, GradCPT face: 27.17 ± 3.86 %, GradCPT scene: 28.11 ± 3.89 %, Sitcom ep1: 40.69 ± 3.86%, Sitcom ep2: 40.53 ± 3.13%, Documentary: 30.13 ± 3.41%.

      Author response image 2.

      (2) Split-half reliability

      To test for the internal robustness of the model, we randomly assigned SONG dataset participants into two groups and conducted HMM separately in each. Similarity (Pearson’s correlation) between the two groups’ activation patterns were DMN: 0.791, DAN: 0.838, SM: 0.944, base: 0.837. The similarity of the covariance patterns were DMN: 0.995, DAN: 0.996, SM: 0.994, base: 0.996.

      Author response image 3.

      We further validated the split-half reliability of the model using the HCP dataset, which contains data of a larger sample (N=119). Similarity (Pearson’s correlation) between the two groups’ activation patterns were DMN: 0.998, DAN: 0.997, SM: 0.993, base: 0.923. The similarity of the covariance patterns were DMN: 0.995, DAN: 0.996, SM: 0.994, base: 0.996.

      Together the cross-validation and split-half reliability results demonstrate that the HMM results reported in the manuscript are reliable and robust to the way we conducted the analysis. The result of the split-half reliability analysis is added in the Results.

      [Manuscript, page 3-4] “Neural state inference was robust to the choice of 𝐾 (Figure 1—figure supplement 1) and the fMRI preprocessing pipeline (Figure 1—figure supplement 5) and consistent when conducted on two groups of randomly split-half participants (Pearson’s correlations between the two groups’ latent state activation patterns: DMN: 0.791, DAN: 0.838, SM: 0.944, base: 0.837).”

      • Comparison with just PCA/functional gradients is weak in establishing whether HMMs are good models of the timeseries. Especially given that the HMM does not explain a lot of variance in the signal (~0.5 R^2 for only 27 brain regions) for PCA. I think they don't report their own R^2 of the timeseries

      We agree with the reviewer that the PCA that we conducted to compare with the explained variance of the functional gradients was not directly comparable because PCA and gradients utilize different algorithms to reduce dimensionality. To make more meaningful comparisons, we removed the data-specific PCA results and replaced them with data-specific functional gradients (derived from the SONG dataset). This allows us to directly compare SONG-specific functional gradients with predefined gradients (derived from the resting-state HCP dataset from Margulies et al. [2016]). We found that the degrees to which the first two predefined gradients explained whole-brain fMRI time series (SONG: 𝑟! = 0.097, HCP: 0.084) were comparable to the amount of variance explained by the first two data-specific gradients (SONG: 𝑟! = 0.100, HCP: 0.086). Thus, the predefined gradients explain as much variance in the SONG data time series as SONG-specific gradients do. This supports our argument that the low-dimensional manifold is largely shared across contexts, and that the common HMM latent states may tile the predefined gradients.

      These analyses and results were added to the Results, Methods, and Figure 1—figure supplement 8. Here, we only attach changes to the Results section for simplicity, but please see the revised manuscript for further changes.

      [Manuscript, page 5-6] “We hypothesized that the spatial gradients reported by Margulies et al. (2016) act as a lowdimensional manifold over which large-scale dynamics operate (Bolt et al., 2022; Brown et al., 2021; Karapanagiotidis et al., 2020; Turnbull et al., 2020), such that traversals within this manifold explain large variance in neural dynamics and, consequently, cognition and behavior (Figure 1C). To test this idea, we situated the mean activity values of the four latent states along the gradients defined by Margulies et al. (2016) (see Methods). The brain states tiled the two-dimensional gradient space with the base state at the center (Figure 1D; Figure1—figure supplement 7). The Euclidean distances between these four states were maximized in the two-dimensional gradient space, compared to a chance where the four states were inferred from circular-shifted time series (p < 0.001). For the SONG dataset, the DMN and SM states fell at more extreme positions of the primary gradient than expected by chance (both FDR-p values = 0.004; DAN and SM states, FDRp values = 0.171). For the HCP dataset, the DMN and DAN states fell at more extreme positions on the primary gradient (both FDR-p values = 0.004; SM and base states, FDR-p values = 0.076). No state was consistently found at the extremes of the secondary gradient (all FDR-p values > 0.021).

      We asked whether the predefined gradients explain as much variance in neural dynamics as latent subspace optimized for the SONG dataset. To do so, we applied the same nonlinear dimensionality reduction algorithm to the SONG dataset’s ROI time series. Of note, the SONG dataset includes 18.95% rest, 15.07% task, and 65.98% movie-watching data whereas the data used by Margulies et al. (2016) was 100% rest. Despite these differences, the SONG-specific gradients closely resembled the predefined gradients, with significant Pearson’s correlations observed for the first (r = 0.876) and second (r = 0.877) gradient embeddings (Figure 1—figure supplement 8). Gradients identified with the HCP data also recapitulated Margulies et al.’s (2016) first (r = 0.880) and second (r = 0.871) gradients. We restricted our analysis to the first two gradients because the two gradients together explained roughly 50% of the entire variance of functional brain connectome (SONG: 46.94%, HCP: 52.08%), and the explained variance dropped drastically from the third gradients (more than 1/3 drop compared to second gradients). The degrees to which the first two predefined gradients explained whole-brain fMRI time series (SONG: 𝑟! = 0.097, HCP: 0.084) were comparable to the amount of variance explained by the first two data-specific gradients (SONG: 𝑟! = 0.100, HCP: 0.086; Figure 1—figure supplement 8). Thus, the low-dimensional manifold captured by Margulies et al. (2016) gradients is highly replicable, explaining brain activity dynamics as well as data-specific gradients, and is largely shared across contexts and datasets. This suggests that the state space of whole-brain dynamics closely recapitulates low-dimensional gradients of the static functional brain connectome.”

      The reviewer also pointed out that the PCA-gradient comparison was weak in establishing whether HMMs are good models of the time series. However, we would like to point out that the purpose of the comparison was not to validate the performance of the HMM. Instead, we wanted to test whether the gradients introduced by Margulies et al. (2016) could act as a generalizable lowdimensional manifold of brain state dynamics. To argue that the predefined gradients are a shared manifold, these gradients should explain SONG data fMRI time series as much as the principal components derived directly from the SONG data. Our results showed comparable 𝑟!, both in predefined gradient vs. data-specific PC comparisons and predefined gradient vs. data-specific gradient comparisons, which supported our argument that the predefined gradients could be the shared embedding space across contexts and datasets.

      The reviewer pointed out that the 𝑟2 of ~0.5 is not explaining enough variance in the fMRI signal. However, we respectfully disagree with this point because there is no established criterion for what constitutes a high or low 𝑟2 for this type of analysis. Of note, previous literature that also applied PCA to fMRI time series (Author response image 4A and 4B) (Lynn et al., 2021; Shine et al., 2019) also found that the cumulative explained variance of top 5 principal components is around 50%. Author response image 4C shows cumulative variances to which gradients explain the functional connectome of the resting-state fMRI data (Margulies et al., 2016).

      Author response image 4.

      Finally, the reviewer pointed out that the 𝑟! of the HMM-derived latent sequence to the fMRI time series should be reported. However, there is no standardized way of measuring the explained variance of the HMM inference. There is no report of explained variance in the traditional HMMfMRI papers (Baker et al., 2014; Chen et al., 2016; Vidaurre et al., 2018, 2017). Rather than 𝑟!, the HMM computes the log likelihood of the model fit. However, because log likelihood values are dependent on the number of data points, studies do not report log likelihood values nor do they use these metrics to interpret the goodness of model fit.

      To ask whether the goodness of the HMM fit was significant above chance, we compared the log likelihood of the HMM to the log likelihood distribution of the null HMM fits. First, we extracted the log likelihood of the HMM fit with the real fMRI time series. We iterated this 1,000 times when calculating null HMMs using the circular-shifted fMRI time series. The log likelihood of the real model was significantly higher than the chance distribution, with a z-value of 2182.5 (p < 0.001). This indicates that the HMM explained a large variance in our fMRI time series data, significantly above chance.

      • Authors do not specify whether they also did cross-validation for the HCP dataset to find 4 clusters

      We apologize for the lack of clarity. When we computed the Calinski-Harabasz score with the HCP dataset, three was chosen as the most optimal number of states (Author response image 5A). When we set K as 3, the HMM inferred the DMN, DAN, and SM states (Author response image 5C). The base state was included when K was set to 4 (Author response image 5B). The activation pattern similarities of the DMN, DAN, and SM states were r = 0.981, 0.984, 0.911 respectively.

      Author response image 5.

      We did not use K = 3 for the HCP data replication because we were not trying to test whether these four states would be the optimal set of states in every dataset. Although the CalinskiHarabasz score chose K = 3 because it showed the best clustering performance, this does not mean that the base state is not meaningful to this dataset. Likewise, the latent states that are inferred when we increase/decrease the number of states are also meaningful states. For example, in Figure 1—figure supplement 1, we show an example of the SONG dataset’s latent states when we set K to 7. The seven latent states included the DAN, SM, and base states, the DMN state was subdivided into DMN-A and DMN-B states, and the FPN state and DMN+VIS state were included. Setting a higher number of states like K = 7 would mean that we are capturing brain state dynamics in a higher dimension than when using K = 4. Because we are utilizing a higher number of states, a model set to K = 7 would inevitably capture a larger variance of fMRI time series than a model set to K = 4.

      The purpose of latent state replication with the HCP dataset was to validate the generalizability of the DMN, DAN, SM, and base states. Before characterizing these latent states’ relevance to cognition, we needed to verify that these latent states were not simply overfit to the SONG dataset. The fact that the HMM revealed a similar set of latent states when applied to the HCP dataset suggested that the states were not merely specific to SONG data.

      To make our points clearer in the manuscript, we emphasized that we are not arguing for the four states to be the exclusive states. We made edits to Discussion as follows.

      [Manuscript, page 16] “Our study adopted the assumption of low dimensionality of large-scale neural systems, which led us to intentionally identify only a small number of states underlying whole-brain dynamics. Importantly, however, we do not claim that the four states will be the optimal set of states in every dataset and participant population. Instead, latent states and patterns of state occurrence may vary as a function of individuals and tasks (Figure 1—figure supplement 2). Likewise, while the lowest dimensions of the manifold (i.e., the first two gradients) were largely shared across datasets tested here, we do not argue that it will always be identical. If individuals and tasks deviate significantly from what was tested here, the manifold may also differ along with changes in latent states (Samara et al., 2023). Brain systems operate at different dimensionalities and spatiotemporal scales (Greene et al., 2023), which may have different consequences for cognition. Asking how brain states and manifolds—probed at different dimensionalities and scales—flexibly reconfigure (or not) with changes in contexts and mental states is an important research question for understanding complex human cognition.”

      • One of their main contributions is the base state but the correlation between the base state in their Song dataset and the HCP dataset is only 0.399

      This is a good point. However, there is precedent for lower spatial pattern correlation of the base state compared to other states in the literature.

      Compared to the DMN, DAN, and SM states, the base state did not show characteristic activation or deactivation of functional networks. Most of the functional networks showed activity levels close to the mean (z = 0). With this flattened activation pattern, relatively low activation pattern similarity was observed between the SONG base state and the HCP base state.

      In Figure 1—figure supplement 6, we write, “The DMN, DAN, and SM states showed similar mean activity patterns. We refrained from making interpretations about the base state’s activity patterns because the mean activity of most of the parcels was close to z = 0”.

      A similar finding has been reported in a previous work by Chen et al. (2016) that discovered the base state with HMM. State 9 (S9) of their results is comparable to our base state. They report that even though the spatial correlation coefficient of the brain state from the split-half reliability analysis was the lowest for S9 due to its low degrees of activation or deactivation, S9 was stably inferred by the HMM. The following is a direct quote from their paper:

      “To the best of our knowledge, a state similar to S9 has not been presented in previous literature. We hypothesize that S9 is the “ground” state of the brain, in which brain activity (or deactivity) is similar for the entire cortex (no apparent activation or deactivation as shown in Fig. 4). Note that different groups of subjects have different spatial patterns for state S9 (Fig. 3A). Therefore, S9 has the lowest reproducible spatial pattern (Fig. 3B). However, its temporal characteristics allowed us to distinguish it consistently from other states.” (Chen et al., 2016)

      Thus, we believe our data and prior results support the existence of the “base state”.

      • Figure 1B: Parcellation is quite big but there seems to be a gradient within regions

      This is a function of the visualization software. Mean activity (z) is the same for all voxels within a parcel. To visualize the 3D contours of the brain, we chose an option in the nilearn python function that smooths the mean activity values based on the surface reconstructed anatomy.

      In the original manuscript, our Methods write, “The brain surfaces were visualized with nilearn.plotting.plot_surf_stat_map. The parcel boundaries in Figure 1B are smoothed from the volume-to-surface reconstruction.”

      • Figure 1D: Why are the DMNs further apart between SONG and HCP than the other states

      To address this question, we first tested whether the position of the DMN states in the gradient space is significantly different for the SONG and HCP datasets. We generated surrogate HMM states from the circular-shifted fMRI time series and positioned the four latent states and the null DMN states in the 2-dimensional gradient space (Author response image 6).

      Author response image 6.

      We next tested whether the Euclidean distance between the SONG dataset’s DMN state and the HCP dataset’s DMN state is larger than would be expected by chance (Author response image 7). To do so, we took the difference between the DMN state positions and compared it to the 1,000 differences generated from the surrogate latent states. The DMN states of the SONG and HCP datasets did not significantly differ in the Gradient 1 dimension (two-tailed test, p = 0.794). However, as the reviewer noted, the positions differed significantly in the Gradient 2 dimension (p = 0.047). The DMN state leaned more towards the Visual gradient in the SONG dataset, whereas it leaned more towards the Somatosensory-Motor gradient in the HCP dataset.

      Author response image 7.

      Though we cannot claim an exact reason for this across-dataset difference, we note a distinctive difference between the SONG and HCP datasets. Both datasets largely included resting-state, controlled tasks, and movie watching. The SONG dataset included 18.95% of rest, 15.07% of task, and 65.98% of movie watching. The task only contained the gradCPT, i.e., sustained attention task. On the other hand, the HCP dataset included 52.71% of rest, 24.35% of task, and 22.94% of movie watching. There were 7 different tasks included in the HCP dataset. It is possible that different proportions of rest, task, and movie watching, and different cognitive demands involved with each dataset may have created data-specific latent states.

      • Page 5 paragraph starting at L25: Their hypothesis that functional gradients explain large variance in neural dynamics needs to be explained more, is non-trivial especially because their R^2 scores are so low (Fig 1. Supplement 8) for PCA

      We address this concern on page 21-23 of this response letter.

      • Generally, I do not find the PCA analysis convincing and believe they should also compare to something like ICA or a different model of dynamics. They do not explain their reasoning behind assuming an HMM, which is an extremely simplified idea of brain dynamics meaning they only change based on the previous state.

      We appreciate this perspective. We replaced the Margulies et al.’s (2016) gradient vs. SONGspecific PCA comparison with a more direct Margulies et al.’s (2016) gradient vs. SONG-specific gradient comparison as described on page 21-23 of this response letter.

      More broadly, we elected to use HMM because of recent work showing correspondence between low-dimensional HMM states and behavior (Cornblath et al., 2020; Taghia et al., 2018; van der Meer et al., 2020; Yamashita et al., 2021). We also found the model’s assumption—a mixture Gaussian emission probability and first-order Markovian transition probability—to be the most suited to analyzing the fMRI time series data. We do not intend to claim that other data-reduction techniques would not also capture low-dimensional, behaviorally relevant changes in brain activity. Instead, our primary focus was identifying a set of latent states that generalize (i.e., recur) across multiple contexts and understanding how those states reflect cognitive and attentional states.

      Although a comparison of possible data-reduction algorithms is out of the scope of the current work, an exhaustive comparison of different models can be found in Bolt et al. (2022). The authors compared dozens of latent brain state algorithms spanning zero-lag analysis (e.g., principal component analysis, principal component analysis with Varimax rotation, Laplacian eigenmaps, spatial independent component analysis, temporal independent component analysis, hidden Markov model, seed-based correlation analysis, and co-activation patterns) to time-lag analysis (e.g., quasi-periodic pattern and lag projections). Bolt et al. (2022) writes “a range of empirical phenomena, including functional connectivity gradients, the task-positive/task-negative anticorrelation pattern, the global signal, time-lag propagation patterns, the quasiperiodic pattern and the functional connectome network structure, are manifestations of the three spatiotemporal patterns.” That is, many previous findings that used different methods essentially describe the same recurring latent states. A similar argument was made in previous papers (Brown et al., 2021; Karapanagiotidis et al., 2020; Turnbull et al., 2020).

      We agree that the HMM is a simplified idea of brain dynamics. We do not argue that the four number of states can fully explain the complexity and flexibility of cognition. Instead, we hoped to show that there are different dimensionalities to which the brain systems can operate, and they may have different consequences to cognition. We “simplified” neural dynamics to a discrete sequence of a small number of states. However, what is fascinating is that these overly “simplified” brain state dynamics can explain certain cognitive and attentional dynamics, such as event segmentation and sustained attention fluctuations. We highlight this point in the Discussion.

      [Manuscript, page 16] “Our study adopted the assumption of low dimensionality of large-scale neural systems, which led us to intentionally identify only a small number of states underlying whole-brain dynamics. Importantly, however, we do not claim that the four states will be the optimal set of states in every dataset and participant population. Instead, latent states and patterns of state occurrence may vary as a function of individuals and tasks (Figure 1—figure supplement 2). Likewise, while the lowest dimensions of the manifold (i.e., the first two gradients) were largely shared across datasets tested here, we do not argue that it will always be identical. If individuals and tasks deviate significantly from what was tested here, the manifold may also differ along with changes in latent states (Samara et al., 2023). Brain systems operate at different dimensionalities and spatiotemporal scales (Greene et al., 2023), which may have different consequences for cognition. Asking how brain states and manifolds—probed at different dimensionalities and scales—flexibly reconfigure (or not) with changes in contexts and mental states is an important research question for understanding complex human cognition.”

      • For the 25- ROI replication it seems like they again do not try multiple K values for the number of states to validate that 4 states are in fact the correct number.

      In the manuscript, we do not argue that the four will be the optimal number of states in any dataset. (We actually predict that this may differ depending on the amount of data, participant population, tasks, etc.) Instead, we claim that the four identified in the SONG dataset are not specific (i.e., overfit) to that sample, but rather recur in independent datasets as well. More broadly we argue that the complexity and flexibility of human cognition stem from the fact that computation occurs at multiple dimensions and that the low-dimensional states observed here are robustly related to cognitive and attentional states. To prevent misunderstanding of our results, we emphasized in the Discussion that we are not arguing for a fixed number of states. A paragraph included in our response to the previous comment (page 16 in the manuscript) illustrates this point.

      • Fig 2B: Colorbar goes from -0.05 to 0.05 but values are up to 0.87

      We apologize for the confusion. The current version of the figure is correct. The figure legend states, “The values indicate transition probabilities, such that values in each row sums to 1. The colors indicate differences from the mean of the null distribution where the HMMs were conducted on the circular-shifted time series.”

      We recognize that this complicates the interpretation of the figure. However, after much consideration, we decided that it was valuable to show both the actual transition probabilities (values) and their difference from the mean of null HMMs (colors). The values demonstrate the Markovian property of latent state dynamics, with a high probability of remaining in the same state at consecutive moments and a low probability of transitioning to a different state. The colors indicate that the base state is a transitional hub state by illustrating that the DMN, DAN, and SM states are more likely to transition to the base state than would be expected by chance.

      • P 16 L4 near-critical, authors need to be more specific in their terminology here especially since they talk about dynamic systems, where near-criticality has a specific definition. It is unclear which definition they are looking for here.

      We agree that our explanation was vague. Because we do not have evidence for this speculative proposal, we removed the mention of near-criticality. Instead, we focus on our observation as the base state being the transitional hub state within a metastable system.

      [Manuscript, page 17-18] “However, the functional relevance of the base state to human cognition had not been explored previously. We propose that the base state, a transitional hub (Figure 2B) positioned at the center of the gradient subspace (Figure 1D), functions as a state of natural equilibrium. Transitioning to the DMN, DAN, or SM states reflects incursion away from natural equilibrium (Deco et al., 2017; Gu et al., 2015), as the brain enters a functionally modular state. Notably, the base state indicated high attentional engagement (Figure 5E and F) and exhibited the highest occurrence proportion (Figure 3B) as well as the longest dwell times (Figure 3—figure supplement 1) during naturalistic movie watching, whereas its functional involvement was comparatively minor during controlled tasks. This significant relevance to behavior verifies that the base state cannot simply be a byproduct of the model. We speculate that susceptibility to both external and internal information is maximized in the base state—allowing for roughly equal weighting of both sides so that they can be integrated to form a coherent representation of the world—at the expense of the stability of a certain functional network (Cocchi et al., 2017; Fagerholm et al., 2015). When processing rich narratives, particularly when a person is fully immersed without having to exert cognitive effort, a less modular state with high degrees of freedom to reach other states may be more likely to be involved. The role of the base state should be further investigated in future studies.”

      • P16 L13-L17 unnecessary

      We prefer to have the last paragraph as a summary of the implications of this paper. However, if the length of this paper becomes a problem as we work towards publication with the editors, we are happy to remove these lines.

      • I think this paper is solid, but my main issue is with using an HMM, never explaining why, not showing inference results on test data, not reporting an R^2 score for it, and not comparing it to other models. Secondly, they use the Calinski-Harabasz score to determine the number of states, but not the log-likelihood of the fit. This clearly creates a bias in what types of states you will find, namely states that are far away from each other, which likely also leads to the functional gradient and PCA results they have. Where they specifically talk about how their states are far away from each other in the functional gradient space and correlated to (orthogonal) components. It is completely unclear to me why they used this measure because it also seems to be one of many scores you could use with respect to clustering (with potentially different results), and even odd in the presence of a loglikelihood fit to the data and with the model they use (which does not perform clustering).

      (1) Showing inference results on test data

      We address this concern on page 19-21 of this response letter.

      (2) Not reporting 𝑹𝟐 score

      We address this concern on page 21-23 of this response letter.

      (3) Not comparing the HMM model to other models

      We address this concern on page 27-28 of this response letter.

      (4) The use of the Calinski-Harabasz score to determine the number of states rather than the log-likelihood of the model fit

      To our knowledge, the log-likelihood of the model fit is not used in the HMM literature. It is because the log-likelihood tends to increase monotonically as the number of states increases. Baker et al. (2014) illustrates this problem, writing:

      “In theory, it should be possible to pick the optimal number of states by selecting the model with the greatest (negative) free energy. In practice however, we observe that the free energy increases monotonically up to K = 15 states, suggesting that the Bayes-optimal model may require an even higher number of states.”

      Similarly, the following figure is the log-likelihood estimated from the SONG dataset. Similar to the findings of Baker et al. (2014), the log-likelihood monotonically increased as the number of states increased (Author response image 8, right). The measures like AIC or BIC, which account for the number of parameters, also have the same issue of monotonic increase.

      Author response image 8.

      Because there is “no straightforward data-driven approach to model order selection” (Baker et al., 2014), past work has used different approaches to decide on the number of states. For example, Vidaurre et al. (2018) iterated over a range of the number of states to repeat the same HMM training and inference procedures 5 times using the same hyperparameters. They selected the number of states that showed the highest consistency across iterations. Gao et al. (2021) tested the clustering performance of the model output using the Calinski-Harabasz score. The number of states that showed the highest within-cluster cohesion compared to the across-cluster separation was selected as the number of states. Chang et al. (2021) applied HMM to voxels of the ventromedial prefrontal cortex using a similar clustering algorithm, writing: “To determine the number of states for the HMM estimation procedure, we identified the number of states that maximized the average within-state spatial similarity relative to the average between-state similarity”. In our previous paper (Song et al., 2021b), we reported both the reliability and clustering performance measures to decide on the number of states.

      In the current manuscript, the model consistency criterion from Vidaurre et al. (2018) was ineffective because the HMM inference was extremely robust (i.e., always inferring the exact same sequence) due to a large number of data points. Thus, we used the Calinski-Harabasz score as our criterion for the number of states selected.

      We agree with the reviewer that the selection of the number of states is critical to any study that implements HMM. However, the field lacks a consensus on how to decide on the number of states in the HMM, and the Calinski-Harabasz score has been validated in previous studies. Most importantly, the latent states’ relationships with behavioral and cognitive measures give strong evidence that the latent states are indeed meaningful states. Again, we are not arguing that the optimal set of states in any dataset will be four nor are we arguing that these four states will always be the optimal states. Instead, the manuscript proposes that a small number of latent states explains meaningful variance in cognitive dynamics.

      • Grammatical error: P24 L29 rendering seems to have gone wrong

      Our intention was correct here. To avoid confusion, we changed “(number of participantsC2 iterations)” to “(#𝐶!iterations, where N=number of participants)” (page 26 in the manuscript).

      Questions:

      • Comment on subject differences, it seems like they potentially found group dynamics based on stimuli, but interesting to see individual differences in large-scale dynamics, and do they believe the states they find mostly explain global linear dynamics?

      We agree with the reviewer that whether low-dimensional latent state dynamics explain individual differences—above and beyond what could be explained by the high-dimensional, temporally static neural signatures of individuals (e.g., Finn et al., 2015)—is an important research question. However, because the SONG dataset was collected in a single lab, with a focus on covering diverse contexts (rest, task, and movie watching) over 2 sessions, we were only able to collect 27 participants. Due to this small sample size, we focused on investigating group-level, shared temporal dynamics and across-condition differences, rather than on investigating individual differences.

      Past work has studied individual differences (e.g., behavioral traits like well-being, intelligence, and personality) using the HMM (Vidaurre et al., 2017). In the lab, we are working on a project that investigates latent state dynamics in relation to individual differences in clinical symptoms using the Healthy Brain Network dataset (Ji et al., 2022, presented at SfN; Alexander et al., 2017).

      Finally, the reviewer raises an interesting question about whether the latent state sequence that was derived here mostly explains global linear dynamics as opposed to nonlinear dynamics. We have two responses: one methodological and one theoretical. First, methodologically, we defined the emission probabilities as a linear mixture of Gaussian distributions for each input dimension with the state-specific mean (mean fMRI activity patterns of the networks) and variance (functional covariance across networks). Therefore, states are modeled with an assumption of linearity of feature combinations. Theoretically, recent work supports in favor of nonlinearity of large-scale neural dynamics, especially as tasks get richer and more complex (Cunningham and Yu, 2014; Gao et al., 2021). However, whether low-dimensional latent states should be modeled nonlinearly—that is, whether linear algorithms are insufficient at capturing latent states compared to nonlinear algorithms—is still unknown. We agree with the reviewer that the assumption of linearity is an interesting topic in systems neuroscience. However, together with prior work which showed how numerous algorithms—either linear or nonlinear—recapitulated a common set of latent states, we argue that the HMM provides a strong low-dimensional model of large-scale neural activity and interaction.

      • P19 L40 why did the authors interpolate incorrect or no-responses for the gradCPT runs? It seems more logical to correct their results for these responses or to throw them out since interpolation can induce huge biases in these cases because the data is likely not missing at completely random.

      Interpolating the RTs of the trials without responses (omission errors and incorrect trials) is a standardized protocol for analyzing gradCPT data (Esterman et al., 2013; Fortenbaugh et al., 2018, 2015; Jayakumar et al., 2023; Rosenberg et al., 2013; Terashima et al., 2021; Yamashita et al., 2021). The choice of this analysis is due to an assumption that sustained attention is a continuous attentional state; the RT, a proxy for the attentional state in the gradCPT literature, is a noisy measure of a smoothed, continuous attentional state. Thus, the RTs of the trials without responses are interpolated and the RT time courses are smoothed by convolving with a gaussian kernel.

      References

      Abbas A, Belloy M, Kashyap A, Billings J, Nezafati M, Schumacher EH, Keilholz S. 2019. Quasiperiodic patterns contribute to functional connectivity in the brain. Neuroimage 191:193–204.

      Alexander LM, Escalera J, Ai L, Andreotti C, Febre K, Mangone A, Vega-Potler N, Langer N, Alexander A, Kovacs M, Litke S, O’Hagan B, Andersen J, Bronstein B, Bui A, Bushey M, Butler H, Castagna V, Camacho N, Chan E, Citera D, Clucas J, Cohen S, Dufek S, Eaves M, Fradera B, Gardner J, Grant-Villegas N, Green G, Gregory C, Hart E, Harris S, Horton M, Kahn D, Kabotyanski K, Karmel B, Kelly SP, Kleinman K, Koo B, Kramer E, Lennon E, Lord C, Mantello G, Margolis A, Merikangas KR, Milham J, Minniti G, Neuhaus R, Levine A, Osman Y, Parra LC, Pugh KR, Racanello A, Restrepo A, Saltzman T, Septimus B, Tobe R, Waltz R, Williams A, Yeo A, Castellanos FX, Klein A, Paus T, Leventhal BL, Craddock RC, Koplewicz HS, Milham MP. 2017. Data Descriptor: An open resource for transdiagnostic research in pediatric mental health and learning disorders. Sci Data 4:1–26.

      Allen EA, Damaraju E, Plis SM, Erhardt EB, Eichele T, Calhoun VD. 2014. Tracking whole-brain connectivity dynamics in the resting state. Cereb Cortex 24:663–676.

      Baker AP, Brookes MJ, Rezek IA, Smith SM, Behrens T, Probert Smith PJ, Woolrich M. 2014. Fast transient networks in spontaneous human brain activity. Elife 3:e01867.

      Bolt T, Nomi JS, Bzdok D, Salas JA, Chang C, Yeo BTT, Uddin LQ, Keilholz SD. 2022. A Parsimonious Description of Global Functional Brain Organization in Three Spatiotemporal Patterns. Nat Neurosci 25:1093–1103.

      Brown JA, Lee AJ, Pasquini L, Seeley WW. 2021. A dynamic gradient architecture generates brain activity states. Neuroimage 261:119526.

      Chang C, Leopold DA, Schölvinck ML, Mandelkow H, Picchioni D, Liu X, Ye FQ, Turchi JN, Duyn JH. 2016. Tracking brain arousal fluctuations with fMRI. Proc Natl Acad Sci U S A 113:4518–4523.

      Chang CHC, Lazaridi C, Yeshurun Y, Norman KA, Hasson U. 2021. Relating the past with the present: Information integration and segregation during ongoing narrative processing. J Cogn Neurosci 33:1–23.

      Chang LJ, Jolly E, Cheong JH, Rapuano K, Greenstein N, Chen P-HA, Manning JR. 2021. Endogenous variation in ventromedial prefrontal cortex state dynamics during naturalistic viewing reflects affective experience. Sci Adv 7:eabf7129.

      Chen J, Leong YC, Honey CJ, Yong CH, Norman KA, Hasson U. 2017. Shared memories reveal shared structure in neural activity across individuals. Nat Neurosci 20:115–125.

      Chen S, Langley J, Chen X, Hu X. 2016. Spatiotemporal Modeling of Brain Dynamics Using RestingState Functional Magnetic Resonance Imaging with Gaussian Hidden Markov Model. Brain Connect 6:326–334.

      Cocchi L, Gollo LL, Zalesky A, Breakspear M. 2017. Criticality in the brain: A synthesis of neurobiology, models and cognition. Prog Neurobiol 158:132–152.

      Cornblath EJ, Ashourvan A, Kim JZ, Betzel RF, Ciric R, Adebimpe A, Baum GL, He X, Ruparel K, Moore TM, Gur RC, Gur RE, Shinohara RT, Roalf DR, Satterthwaite TD, Bassett DS. 2020. Temporal sequences of brain activity at rest are constrained by white matter structure and modulated by cognitive demands. Commun Biol 3:261.

      Cunningham JP, Yu BM. 2014. Dimensionality reduction for large-scale neural recordings. Nat Neurosci 17:1500–1509.

      Deco G, Kringelbach ML, Jirsa VK, Ritter P. 2017. The dynamics of resting fluctuations in the brain: Metastability and its dynamical cortical core. Sci Rep 7:3095.

      Esterman M, Noonan SK, Rosenberg M, Degutis J. 2013. In the zone or zoning out? Tracking behavioral and neural fluctuations during sustained attention. Cereb Cortex 23:2712–2723.

      Esterman M, Rothlein D. 2019. Models of sustained attention. Curr Opin Psychol 29:174–180.

      Fagerholm ED, Lorenz R, Scott G, Dinov M, Hellyer PJ, Mirzaei N, Leeson C, Carmichael DW, Sharp DJ, Shew WL, Leech R. 2015. Cascades and cognitive state: Focused attention incurs subcritical dynamics. J Neurosci 35:4626–4634.

      Falahpour M, Chang C, Wong CW, Liu TT. 2018. Template-based prediction of vigilance fluctuations in resting-state fMRI. Neuroimage 174:317–327.

      Finn ES, Shen X, Scheinost D, Rosenberg MD, Huang J, Chun MM, Papademetris X, Constable RT. 2015. Functional connectome fingerprinting: Identifying individuals using patterns of brain connectivity. Nat Neurosci 18:1664–1671.

      Fortenbaugh FC, Degutis J, Germine L, Wilmer JB, Grosso M, Russo K, Esterman M. 2015. Sustained attention across the life span in a sample of 10,000: Dissociating ability and strategy. Psychol Sci 26:1497–1510.

      Fortenbaugh FC, Rothlein D, McGlinchey R, DeGutis J, Esterman M. 2018. Tracking behavioral and neural fluctuations during sustained attention: A robust replication and extension. Neuroimage 171:148–164.

      Fox MD, Snyder AZ, Vincent JL, Corbetta M, Van Essen DC, Raichle ME. 2005. The human brain is intrinsically organized into dynamic, anticorrelated functional networks. Proc Natl Acad Sci U S A 102:9673–9678.

      Gao S, Mishne G, Scheinost D. 2021. Nonlinear manifold learning in functional magnetic resonance imaging uncovers a low-dimensional space of brain dynamics. Hum Brain Mapp 42:4510–4524.

      Goodale SE, Ahmed N, Zhao C, de Zwart JA, Özbay PS, Picchioni D, Duyn J, Englot DJ, Morgan VL, Chang C. 2021. Fmri-based detection of alertness predicts behavioral response variability. Elife 10:1–20.

      Greene AS, Horien C, Barson D, Scheinost D, Constable RT. 2023. Why is everyone talking about brain state? Trends Neurosci.

      Greene DJ, Marek S, Gordon EM, Siegel JS, Gratton C, Laumann TO, Gilmore AW, Berg JJ, Nguyen AL, Dierker D, Van AN, Ortega M, Newbold DJ, Hampton JM, Nielsen AN, McDermott KB, Roland JL, Norris SA, Nelson SM, Snyder AZ, Schlaggar BL, Petersen SE, Dosenbach NUF. 2020. Integrative and Network-Specific Connectivity of the Basal Ganglia and Thalamus Defined in Individuals. Neuron 105:742-758.e6.

      Gu S, Pasqualetti F, Cieslak M, Telesford QK, Yu AB, Kahn AE, Medaglia JD, Vettel JM, Miller MB, Grafton ST, Bassett DS. 2015. Controllability of structural brain networks. Nat Commun 6:8414.

      Jayakumar M, Balusu C, Aly M. 2023. Attentional fluctuations and the temporal organization of memory. Cognition 235:105408.

      Ji E, Lee JE, Hong SJ, Shim W (2022). Idiosyncrasy of latent neural state dynamic in ASD during movie watching. Poster presented at the Society for Neuroscience 2022 Annual Meeting.

      Karapanagiotidis T, Vidaurre D, Quinn AJ, Vatansever D, Poerio GL, Turnbull A, Ho NSP, Leech R, Bernhardt BC, Jefferies E, Margulies DS, Nichols TE, Woolrich MW, Smallwood J. 2020. The psychological correlates of distinct neural states occurring during wakeful rest. Sci Rep 10:1–11.

      Liu X, Duyn JH. 2013. Time-varying functional network information extracted from brief instances of spontaneous brain activity. Proc Natl Acad Sci U S A 110:4392–4397.

      Liu X, Zhang N, Chang C, Duyn JH. 2018. Co-activation patterns in resting-state fMRI signals. Neuroimage 180:485–494.

      Lynn CW, Cornblath EJ, Papadopoulos L, Bertolero MA, Bassett DS. 2021. Broken detailed balance and entropy production in the human brain. Proc Natl Acad Sci 118:e2109889118.

      Margulies DS, Ghosh SS, Goulas A, Falkiewicz M, Huntenburg JM, Langs G, Bezgin G, Eickhoff SB, Castellanos FX, Petrides M, Jefferies E, Smallwood J. 2016. Situating the default-mode network along a principal gradient of macroscale cortical organization. Proc Natl Acad Sci U S A 113:12574–12579.

      Mesulam MM. 1998. From sensation to cognition. Brain 121:1013–1052.

      Munn BR, Müller EJ, Wainstein G, Shine JM. 2021. The ascending arousal system shapes neural dynamics to mediate awareness of cognitive states. Nat Commun 12:1–9.

      Raut R V., Snyder AZ, Mitra A, Yellin D, Fujii N, Malach R, Raichle ME. 2021. Global waves synchronize the brain’s functional systems with fluctuating arousal. Sci Adv 7.

      Rosenberg M, Noonan S, DeGutis J, Esterman M. 2013. Sustaining visual attention in the face of distraction: A novel gradual-onset continuous performance task. Attention, Perception, Psychophys 75:426–439.

      Rosenberg MD, Finn ES, Scheinost D, Papademetris X, Shen X, Constable RT, Chun MM. 2016. A neuromarker of sustained attention from whole-brain functional connectivity. Nat Neurosci 19:165–171.

      Rosenberg MD, Scheinost D, Greene AS, Avery EW, Kwon YH, Finn ES, Ramani R, Qiu M, Todd Constable R, Chun MM. 2020. Functional connectivity predicts changes in attention observed across minutes, days, and months. Proc Natl Acad Sci U S A 117:3797–3807.

      Saggar M, Shine JM, Liégeois R, Dosenbach NUF, Fair D. 2022. Precision dynamical mapping using topological data analysis reveals a hub-like transition state at rest. Nat Commun 13.

      Schaefer A, Kong R, Gordon EM, Laumann TO, Zuo X-N, Holmes AJ, Eickhoff SB, Yeo BTT. 2018. Local-Global Parcellation of the Human Cerebral Cortex from Intrinsic Functional Connectivity MRI. Cereb Cortex 28:3095–3114.

      Shine JM. 2019. Neuromodulatory Influences on Integration and Segregation in the Brain. Trends Cogn Sci 23:572–583.

      Shine JM, Bissett PG, Bell PT, Koyejo O, Balsters JH, Gorgolewski KJ, Moodie CA, Poldrack RA. 2016. The Dynamics of Functional Brain Networks: Integrated Network States during Cognitive Task Performance. Neuron 92:544–554.

      Shine JM, Breakspear M, Bell PT, Ehgoetz Martens K, Shine R, Koyejo O, Sporns O, Poldrack RA. 2019. Human cognition involves the dynamic integration of neural activity and neuromodulatory systems. Nat Neurosci 22:289–296.

      Smith SM, Fox PT, Miller KL, Glahn DC, Fox PM, Mackay CE, Filippini N, Watkins KE, Toro R, Laird AR, Beckmann CF. 2009. Correspondence of the brain’s functional architecture during activation and rest. Proc Natl Acad Sci 106:13040–13045.

      Song H, Emily FS, Rosenberg MD. 2021a. Neural signatures of attentional engagement during narratives and its consequences for event memory. Proc Natl Acad Sci 118:e2021905118.

      Song H, Park B-Y, Park H, Shim WM. 2021b. Cognitive and Neural State Dynamics of Narrative Comprehension. J Neurosci 41:8972–8990.

      Taghia J, Cai W, Ryali S, Kochalka J, Nicholas J, Chen T, Menon V. 2018. Uncovering hidden brain state dynamics that regulate performance and decision-making during cognition. Nat Commun 9:2505.

      Terashima H, Kihara K, Kawahara JI, Kondo HM. 2021. Common principles underlie the fluctuation of auditory and visual sustained attention. Q J Exp Psychol 74:705–715.

      Tian Y, Margulies DS, Breakspear M, Zalesky A. 2020. Topographic organization of the human subcortex unveiled with functional connectivity gradients. Nat Neurosci 23:1421–1432.

      Turnbull A, Karapanagiotidis T, Wang HT, Bernhardt BC, Leech R, Margulies D, Schooler J, Jefferies E, Smallwood J. 2020. Reductions in task positive neural systems occur with the passage of time and are associated with changes in ongoing thought. Sci Rep 10:1–10.

      Unsworth N, Robison MK. 2018. Tracking arousal state and mind wandering with pupillometry. Cogn Affect Behav Neurosci 18:638–664.

      Unsworth N, Robison MK. 2016. Pupillary correlates of lapses of sustained attention. Cogn Affect Behav Neurosci 16:601–615.

      van der Meer JN, Breakspear M, Chang LJ, Sonkusare S, Cocchi L. 2020. Movie viewing elicits rich and reliable brain state dynamics. Nat Commun 11:1–14.

      Van Essen DC, Smith SM, Barch DM, Behrens TEJ, Yacoub E, Ugurbil K. 2013. The WU-Minn Human Connectome Project: An overview. Neuroimage 80:62–79.

      Vidaurre D, Abeysuriya R, Becker R, Quinn AJ, Alfaro-Almagro F, Smith SM, Woolrich MW. 2018. Discovering dynamic brain networks from big data in rest and task. Neuroimage, Brain Connectivity Dynamics 180:646–656.

      Vidaurre D, Smith SM, Woolrich MW. 2017. Brain network dynamics are hierarchically organized in time. Proc Natl Acad Sci U S A 114:12827–12832.

      Yamashita A, Rothlein D, Kucyi A, Valera EM, Esterman M. 2021. Brain state-based detection of attentional fluctuations and their modulation. Neuroimage 236:118072.

      Yeo BTT, Krienen FM, Sepulcre J, Sabuncu MR, Lashkari D, Hollinshead M, Roffman JL, Smoller JW, Zöllei L, Polimeni JR, Fisch B, Liu H, Buckner RL. 2011. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. J Neurophysiol 106:1125–1165.

      Yousefi B, Keilholz S. 2021. Propagating patterns of intrinsic activity along macroscale gradients coordinate functional connections across the whole brain. Neuroimage 231:117827.

      Zhang S, Goodale SE, Gold BP, Morgan VL, Englot DJ, Chang C. 2023. Vigilance associates with the low-dimensional structure of fMRI data. Neuroimage 267.

    1. Author Response

      Reviewer #2 (Public Review):

      "The cellular architecture of memory modules in Drosophila supports stochastic input integration" is a classical biophysical compartmental modelling study. It takes advantage of some simple current injection protocols in a massively complex mushroom body neuron called MBON-a3 and compartmental models that simulate the electrophysiological behaviour given a detailed description of the anatomical extent of its neurites.

      This work is interesting in a number of ways:

      • The input structure information comes from EM data (Kenyon cells) although this is not discussed much in the paper - The paper predicts a potentially novel normalization of the throughput of KC inputs at the level of the proximal dendrite and soma - It claims a new computational principle in dendrites, this didn’t become very clear to me Problems I see:

      • The current injections did not last long enough to reach steady state (e.g. Figure 1FG), and the model current injection traces have two time constants but the data only one (Figure 2DF). This does not make me very confident in the results and conclusions.

      These are two important but separate questions that we would like to address in turn.

      As for the first, in our new recordings using cytoplasmic GFP to identify MBON-alpha3, we performed both a 200 ms current injection and performed prolonged recordings of 400 ms to reach steady state (for all 4 new cells 1’-4’). For comparison with the original dataset we mainly present the raw traces for 200 ms recordings in Figure 1 Supplement 2. In addition, we now provide a direct comparison of these recordings (200 ms versus 400 ms) and did not observe significant differences in tau between these data (Figure 1 Supplement 2 K). This comparison illustrates that the 200 ms current injection reaches a maximum voltage deflection that is close to the steady state level of the prolonged protocol. Importantly, the critical parameter (tau) did not change between these datasets.

      Regarding the second question, the two different time constants, we thank the reviewer for pointing this out. Indeed, while the simulated voltage follows an approximately exponential decay which is, by design, essentially identical to the measured value (τ≈ 16ms, from Table 1; ee Figure 1 Supplement 2 for details), the voltage decays and rises much faster immediately following the onset and offset of the current injections. We believe that this is due to the morphology of this neuron. Current injection, and voltage recordings, are at the soma which is connected to the remainder of the neuron by a long and thin neurite. This ’remainder’ is, of course, in linear size, volume and surface (membrane) area much larger than the soma, see Fig 2A. As a result, a current injection will first quickly charge up the membrane of the soma, resulting in the initial fast voltage changes seen in Fig 2D,F, before the membrane in the remainder of the cell is charged, with the cell’s time constant τ.

      We confirmed this intuition by running various simplified simulations in Neuron which indeed show a much more rapid change at step changes in injected current than over the long-term. Indeed, we found that the pattern even appears in the simplest possible two-compartment version of the neuron’s equivalent circuit which we solved in an all-purpose numerical simulator of electrical circuitry (https://www.falstad.com/circuit). The circuit is shown in Figure 1. We chose rather generic values for the circuit components, with the constraints that the cell capacitance, chosen as 15pF, and membrane resistance, chosen as 1GΩ, are in the range of the observed data (as is, consequently, its time constant which is 15ms with these choices); see Table 1 of the manuscript. We chose the capacitance of the soma as 1.5pF, making the time constant of the soma (1.5ms) an order of magnitude shorter than that of the cell.

      Figure 1: Simplified circuit of a small soma (left parallel RC circuit) and the much larger remainder of a cell (right parallel RC circuit) connected by a neurite (right 100MΩ resistor). A current source (far left) injects constant current into the soma through the left 100MΩ resistor.

      Figure 2 shows the somatic voltage in this circuit (i.e., at the upper terminal of the 1.5pF capacitor) while a -10pA current is injected for about 4.5ms, after which the current is set back to zero. The combination of initial rapid change, followed by a gradual change with a time constant of ≈ 15ms is visible at both onset and offset of the current injection. Figure 3 show the voltage traces plotted for a duration of approximately one time constant, and Fig 4 shows the detailed shape right after current onset.

      Figure 2: Somatic voltage in the circuit in Fig. 1 with current injection for about 4.5ms, followed by zero current injection for another ≈ 3.5ms.

      Figure 3: Somatic voltage in the circuit, as in Fig. 2 but with current injected for approx. 15msvvvvv

      While we did not try to quantitatively assess the deviation from a single-exponential shape of the voltage in Fig. 2E, a more rapid increase at the onset and offset of the current injection is clearly visible in this Figure. This deviation from a single exponential is smaller than what we see in the simulation (both in Fig 2D of the manuscript, and in the results of the simplified circuit here in the rebuttal). We believe that the effect is smaller in Fig. E because it shows the average over many traces. It is much more visible in the ’raw’ (not averaged) traces. Two randomly selected traces from the first of the recorded neurons are shown in Figure 2 Supplement 2 C. While the non-averaged traces are plagued by artifacts and noise, the rapid voltage changes are visible essentially at all onsets and offsets of the current injection.

      Figure 4: Somatic voltage in the circuit, as in Fig. 2 but showing only for the time right after current onset, about 2.3ms.

      We have added a short discussion of this at the end of Section 2.3 to briefly point out this observation and its explanation. We there also refer to the simplified circuit simulation and comparison with raw voltage traces which is now shown in the new Figure 2 Supplement 2.

      • The time constant in Table 1 is much shorter than in Figure 1FG?

      No, these values are in agreement. To facilitate the comparison we now include a graphical measurement of tau from our traces in Figure 1 Supplement 2 J.

      • Related to this, the capacitance values are very low maybe this can be explained by the model’s wrong assumption of tau?

      Indeed, the measured time constants are somewhat lower than what might be expected. We believe that this is because after a step change of the injected current, an initial rapid voltage change occurs in the soma, where the recordings are taken. The measured time constant is a combination of the ’actual’ time constant of the cell and the ’somatic’ (very short) time constant of the soma. Please see our explanations above.

      Importantly, the value for tau from Table 1 is not used explicitly in the model as the parameters used in our simulation are determined by optimal fits of the simulated voltage curves to experimentally obtained data.

      • That latter in turn could be because of either space clamp issues in this hugely complex cell or bad model predictions due to incomplete reconstructions, bad match between morphology and electrophysiology (both are from different datasets?), or unknown ion channels that produce non-linear behaviour during the current injections.

      Please see our detailed discussion above. Furthermore, we now provide additional recordings using cytoplasmic GFP as a marker for the identification of MBON-alpha3 and confirm our findings. We agree that space-clamp issues could interfere with our recordings in such a complex cell. However, our approach using electrophysiological data should still be superior to any other approach (picking text book values). As we injected negative currents for our analysis at least voltage-gated ion channels should not influence our recordings.

      • The PRAXIS method in NEURON seems too ad hoc. Passive properties of a neuron should probably rather be explored in parameter scans.

      We are a bit at a loss of what is meant by the PRAXIS method being "too ad hoc." The PRAXIS method is essentially a conjugate gradient optimization algorithm (since no explicit derivatives are available, it makes the assumption that the objective function is quadratic). This seems to us a systematic way of doing a parameter scan, and the procedure has been used in other related models, e.g. the cited Gouwens & Wilson (2009) study.

      Questions I have:

      • Computational aspects were previously addressed by e.g. Larry Abbott and Gilles Laurent (sparse coding), how do the findings here distinguish themselves from this work

      In contrast to the work by Abbott and Laurent that addressed the principal relevance and suitability of sparse and random coding for the encoding of sensory information in decision making, here we address the cellular and computational mechanisms that an individual node (KC>MBON) play within the circuitry. As we use functional and morphological relevant data this study builds upon the prior work but significantly extends the general models to a specific case. We think this is essential for the further exploration of the topic.

      • What is valence information?

      Valence information is the information whether a stimulus is good (positive valence, e.g. sugar in appetitive memory paradigms, or negative valence in aversive olfactory conditioning - the electric shock). Valence information is provided by the dopaminergic system. Dopaminergic neurons are in direct contact with the KC>MBON circuitry and modify its synaptic connectivity when olfactory information is paired with a positive or negative stimulus.

      • It seems that Martin Nawrot’s work would be relevant to this work

      We are aware of the work by the Nawrot group that provided important insights into the processing of information within the olfactory mushroom body circuitry. We now highlight some of his work. His recent work will certainly be relevant for our future studies when we try to extend our work from an individual cell to networks.

      • Compactification and democratization could be related to other work like Otopalik et al 2017 eLife but also passive normalization. The equal efficiency in line 427 reminds me of dendritic/synaptic democracy and dendritic constancy

      Many thanks for pointing this out. This is in line with the comments from reviewer 1 and we now highlight these papers in the relevant paragraph in the discussion (line 442ff).

      • The morphology does not obviously seem compact, how unusual would it be that such a complex dendrite is so compact?

      We should have been more careful in our terminology, making clear that when we write ’compact’ we always mean ’electrotonically compact," in the sense that the physical dimensions of the neuron are small compared to its characteristic electrotonic length (usually called λ). The degree of a dendritic structure being electrotonically compact is determined by the interaction of morphology, size and conductances (across the membrane and along the neurites). We don’t believe that one of these factors alone (e.g. morphology) is sufficient to characterize the electrical properties of a dendritic tree. We have now clarified this in the relevant section.

      • What were the advantages of using the EM circuit?

      The purpose of our study is to provide a "realistic" model of a KC>MBON node within the memory circuitry. We started our simulations with random synaptic locations but wondered whether such a stochastic model is correct, or whether taking into account the detailed locations and numbers of synaptic connections of individual KCs would make a difference to the computation. Therefore we repeated the simulations using the EM data. We now address the point between random vs realistic synaptic connectivity in Figure 4F. We do not observe a significant difference but this may become more relevant in future studies if we compute the interplay between MBONs activated by overlapping sets of KCs. We simply think that utilizing the EM data gets us one step closer to realistic models.

      • Isn’t Fig 4E rather trivial if the cell is compact?

      We believe this figure is a visually striking illustration that shows how electrotonically compact the cell is. Such a finding may be trivial in retrospect, once the data is visualized, but we believe it provides a very intuitive description of the cell behavior.

      Overall, I am worried that the passive modelling study of the MBON-a3 does not provide enough evidence to explain the electrophysiological behaviour of the cell and to make accurate predictions of the cell’s responses to a variety of stochastic KC inputs.

      In our view our model adequately describes the behavior of the MBON with the most minimal (passive) model. Our approach tries to make the least assumptions about the electrophysiological properties of the cell. We think that based on the current knowledge our approach is the best possible approach as thus far no active components within the dendritic or axonal compartments of Drosophila MBONs have been described. As such, our model describes the current status which explains the behavior of the cell very well. We aim to refine this model in the future if experimental evidence requires such adaptations.

      Reviewer #3 (Public Review):

      This manuscript presents an analysis of the cellular integration properties of a specific mushroom body output neuron, MBON-α3, using a combination of patch clamp recordings and data from electron microscopy. The study demonstrates that the neuron is electrotonically compact permitting linear integration of synaptic input from Kenyon cells that represent odor identity.

      Strengths of the manuscript:

      The study integrates morphological data about MBON-α3 along with parameters derived from electrophysiological measurements to build a detailed model. 2) The modeling provides support for existing models of how olfactory memory is related to integration at the MBON.

      Weaknesses of the manuscript:

      The study does not provide experimental validation of the results of the computational model.

      The goal of our study is to use computational approaches to provide insights into the computation of the MBON as part of the olfactory memory circuitry. Our data is in agreement with the current model of the circuitry. Our study therefore forms the basis for future experimental studies; those would however go beyond the scope of the current work.

      The conclusion of the modeling analysis is that the neuron integrates synaptic inputs almost completely linearly. All the subsequent analyses are straightforward consequences of this result.

      We do, indeed, find that synaptic integration in this neuron is almost completely linear. We demonstrate that this result holds in a variety of different ways. All analyses in the study serve this purpose. These results are in line with the findings by Hige and Turner (2013) who demonstrated that also synaptic integration at PN>KC synapses is highly linear. As such our data points to a feature conservation to the next node of this circuit.

      The manuscript does not provide much explanation or intuition as to why this linear conclusion holds.

      We respectfully disagree. We demonstrate that this linear integration is a combination of the size of the cell and the combination of its biophysical parameters, mainly the conductances across and along the neurites. As to why it holds, our main argument is that results based on the linear model agree with all known (to us) empirical results, and this is the simplest model.

      In general, there is a clear takeaway here, which is that the dendritic tree of MBON-α3 in the lobes is highly electrotonically compact. The authors did not provide much explanation as to why this is, and the paper would benefit from a clearer conclusion. Furthermore, I found the results of Figures 4 and 5 rather straightforward given this previous observation. I am sceptical about whether the tiny variations in, e.g. Figs. 3I and 5F-H, are meaningful biologically.

      Please see the comment above as to the ’why’ we believe the neuron is electrotonically compact: a model with this assumption agrees well with empirically found results.

      We agree that the small variations in Fig 5F-H are likely not biologically meaningful. We state this now more clearly in the figure legends and in the text. This result is important to show, however. It is precisely because these variations are small, compared to the differences between voltage differences between different numbers of activated KCs (Fig 5D) or different levels of activated synapses (Fig 5E) that we can conclude that a 25% change in either synaptic strength or number can represent clearly distinguishable internal states, and that both changes have the same effect. It is important to show these data, to allow the reader to compare the differences that DO matter (Fig 5D,E) and those that DON’T (Fig 5F-H).

      The same applies to Fig 3I. The reviewer is entirely correct: the differences in the somatic voltage shown in Figure 3I are minuscule, less than a micro-Volt, and it is very unlikely that these difference have any biological meaning. The point of this figure is exactly to show this!. It is to demonstrate quantitatively the transformation of the large differences between voltages in the dendritic tree and the nearly complete uniform voltage at the soma. We feel that this shows very clearly the extreme "democratization" of the synaptic input!

    1. Author Response

      Reviewer #1 (Public Review):

      Nicotine preference is highly variable between individuals. The paper by Mondoloni et al. provided some insight into the potential link between IPN nAchR heterogeneity with male nicotine preference behavior. They scored mice using the amount of nicotine consumption, as well as the rats' preference of the drug using a two-bottle choice experiment. An interesting heterogeneity in nicotine-drinking profiles was observed in adult male mice, with about half of the mice ceasing nicotine consumption at high concentrations. They observed a negative association of nicotine intake with nicotine-evoked currents in the antiparticle nucleus (IPN). They also identified beta4-containing nicotine acetylcholine receptors, which exhibit an association with nicotine aversion. The behavioral differentiation of av vs. n-avs and identification of IPN variability, both in behavioral and electrophysiological aspects, add an important candidate for analyzing individual behavior in addiction.

      The native existence of beta4-nAchR heterogeneity is an important premise that supports the molecules to be the candidate substrate of variabilities. However, only knockout and re-expression models were used, which is insufficient to mimic the physiological state that leads to variability in nicotine preference.

      We’d like to thank reviewer 1 for his/her positive remarks and for suggesting important control experiments. Regarding the reviewer’s latest comment on the link between b4 and variability, we would like to point out that the experiment in which mice were put under chronic nicotine can be seen as another way to manipulate the physiological state of the animal. Indeed, we found that chronic nicotine downregulates b4 nAChR expression levels (but has no effect on residual nAChR currents in b4-/- mice) and reduces nicotine aversion. Therefore, these results also point toward a role of IPN b4 nAChRs in nicotine aversion. We have now performed additional experiments and analyses to address these concerns and to reinforce our demonstration.

      Reviewer #2 (Public Review):

      In the current study, Mondoloni and colleagues investigate the neural correlates contributing to nicotine aversion and its alteration following chronic nicotine exposure. The question asked is important to the field of individual vulnerability to drug addiction and has translational significance. First, the authors identify individual nicotine consumption profiles across isogenic mice. Further, they employed in vivo and ex vivo physiological approaches to defining how antiparticle nuclei (IPn) neuronal response to nicotine is associated with nicotine avoidance. Additionally, the authors determine that chronic nicotine exposure impairs IPn neuronal normal response to nicotine, thus contributing to higher amounts of nicotine consumption. Finally, they used transgenic and viralmediated gene expression approaches to establish a causal link between b4 nicotine receptor function and nicotine avoidance processes.

      The manuscript and experimental strategy are well designed and executed; the current dataset requires supplemental analyses and details to exclude possible alternatives. Overall, the results are exciting and provide helpful information to the field of drug addiction research, individual vulnerability to drug addiction, and neuronal physiology. Below are some comments aiming to help the authors improve this interesting study.

      We would like to thank the reviewer for his/her positive remarks and we hope the new version of the manuscript will clarify his/her concerns.

      1) The authors used a two-bottle choice behavioral paradigm to investigate the neurophysiological substrate contributing to nicotine avoidance behaviors. While the data set supporting the author's interpretation is compelling and the experiments are well-conducted, a few supplemental control analyses will strengthen the current manuscript.

      a) The bitter taste of nicotine might generate confounds in the data interpretation: are the mice avoiding the bitterness or the nicotine-induced physiological effect? To address this question, the authors mixed nicotine with saccharine, thus covering the bitterness of nicotine. Additionally, the authors show that all the mice exposed to quinine avoid it, and in comparison, the N-Av don't avoid the bitterness of the nicotine-saccharine solution. Yet it is unclear if Av and N-Av have different taste discrimination capacities and if such taste discrimination capacities drive the N-Av to consume less nicotine. Would Av and N-Av mice avoid quinine differently after the 20-day nicotine paradigm? Would the authors observe individual nicotine drinking behaviors if nicotine/quinine vs. quinine were offered to the mice?

      As requested by all three reviewers, we have now performed a two-bottle choice experiment to verify whether different sensitivities to the bitterness of the nicotine solution could explain the different sensitivities to the aversive properties of nicotine. Indeed, even though we used saccharine to mask the bitterness of the nicotine solution, we cannot fully exclude the possibility that the taste capacity of the mice could affect their nicotine consumption. Reviewers 1 and 2 suggested to perform nicotine/quinine versus quinine preference tests, but we were afraid that forcing mice to drink an aversive, quinine-containing solution might affect the total volume of liquid consumed per day, and also might create a “generalized conditioned aversion to drinking water - detrimental to overall health and a confounding factor” as pointed out by reviewer 3. Therefore, we designed the experiment a little differently.

      In this two-bottle choice experiment, mice were first proposed a high concentration of nicotine (100 µg/ml) which has previously been shown to induce avoidance behavior in mice (Figure 3C). Then, mice were offered three increasing concentrations of quinine: 30, 100 and 300 µM. Quinine avoidance was dose dependent, as expected: it was moderate for 30 µM but almost absolute for 300 µM quinine. We then investigated whether nicotine and quinine avoidances were linked. We found no correlation between nicotine and quinine preference (new Figure: Figure 1- supplementary figure 1D). This new experiment strongly suggests that aversion to the drug is not directly tied to the sensitivity of mice to the bitter taste of nicotine.

      Other results reinforce this conclusion. First, none of the b4-/- mice (0/13) showed aversion to nicotine, whereas about half of the virally-rescued animals (8/17, b4 re-expressed in the IPN of b4-/- mice) showed nicotine aversion, a proportion similar to the one observed in WT mice. This experiment makes a clear, direct link between the expression of b4 nAChRs in the IPN and aversion to the drug.

      Furthermore, we also verified that the sensitivity of b4-/- mice to bitterness is not different from that of WT mice (new Figure 4 – figure supplement 1B). This new result indicates that the reason why b4-/- mice consume more nicotine than WT mice is not because they have a reduced sensitivity bitterness.

      Together, these new experiments strongly suggests that interindividual differences in sensitivity to the bitterness of nicotine play little role in nicotine consumption behavior in mice.

      b) Metabolic variabilities amongst isogenic mice have been observed. Thus, while the mice consume different amounts of nicotine, changes in metabolic processes, thus blood nicotine concentrations, could explain differences in nicotine consumption and neurophysiology across individuals. The authors should control if the blood concentration of nicotine metabolites between N-Av and Av are similar when consuming identical amounts of nicotine (50ug/ml), different amounts (200ug/ml), and in response to an acute injection of a fixed nicotine quantity.

      We agree with the reviewer that metabolic variabilities could explain (at least in part) the differences observed between avoiders and non-avoiders. But other factors could also play a role, such as stress level (there is a strong interaction between stress and nicotine addiction, as shown by our group (PMID: 29155800, PMID: 30361503) and others), hierarchical ranking, epigenetic factors etc… Our goal in this study is not to examine all possible sources of variability. What is striking about our results is that deletion of a single gene (encoding the nAChR b4 subunit) is sufficient to eliminate nicotine avoidance, and that re-expression of this receptor subunit in the IPN is sufficient to restore nicotine avoidance. In addition, we observe a strong correlation between the amplitude of nicotineinduced current in the IPN, and nicotine consumption. Therefore, the expression level of b4 in the IPN is sufficient to explain most of the behavioral variability we observe. We do not feel the need to explore variations in metabolic activities, which are (by the way) very expensive experiments. However, we have added a sentence in the discussion to mention metabolic variabilities as a potential source of variability in nicotine consumption.

      2) Av mice exposed to nicotine_200ug/ml display minimal nicotine_50ug/ml consumption, yet would Av mice restore a percent nicotine consumption >20 when exposed to a more extended session at 50ug/kg? Such a data set will help identify and isolate learned avoidance processes from dose-dependent avoidance behaviors.

      We have now performed an additional two-bottle choice experiment to examine an extended time at 50 µg/ml. But we also performed the experiment a little differently. We directly proposed a high nicotine concentration to mice (200 µg/ml), followed by 8 days at 50 µg/ml. We found that, overall, mice avoided the 200 µg/ml nicotine solution, and that the following increase in nicotine preference was slow and gradual throughout the eight days at 50 µg/ml (Figure 2-figure supplement 1C). This slow adjustment to a lower-dose contrasts with the rapid (within a day) change in intake observed when nicotine concentration increases (Figure 1-figure supplement 1A). About half of the mice (6/13) retained a steady, low nicotine preference (< 20%) throughout the eight days at 50 µg/ml, resembling what was observed for avoiders in Figure 2D. Together, these results suggest that some of the mice, the non-avoiders, rapidly adjust their intake to adapt to changes in nicotine concentration in the bottle. For avoiders, aversion for nicotine seems to involve a learning mechanism that, once triggered, results in prolonged cessation of nicotine consumption.

      3) The author should further investigate the basal properties of IPn neuron in vivo firing rate activity recorded and establish if their spontaneous activity determines their nicotine responses in vivo, such as firing rate, ISI, tonic, or phasic patterns. These analyses will provide helpful information to the neurophysiologist investigating the function of IPn neurons and will also inform how chronic nicotine exposure shapes the IPn neurophysiological properties.

      We have performed additional analyses of the in vivo recordings. First, we have built maps of the recorded neurons, and we show that there is no anatomical bias in our sampling between the different groups. The only condition for which we did not sample neurons similarly is when we compare the responses to nicotine in vivo in WT and b4-/- mice (Figure 4E). The two groups were not distributed similarly along the dorso-ventral axis (Figure 4-figure supplement 2B). Yet, we do not think that the difference in nicotine responses observed between WT and b4-/- mice is due to a sampling bias. Indeed, we found no link between the response to nicotine and the dorsoventral coordinates of the neurons, in any of the groups (MPNic and MP Sal in Figure 3-figure supplement 1D; WT and b4-/- mice in Figure 4-figure supplement 2C). Therefore, our different groups are directly comparable, and the conclusions drawn in our study fully justified.

      As requested, we have looked at whether the basal firing rate of IPN neurons determines the response to nicotine and indeed, neurons with higher firing rate show greater change in firing frequency upon nicotine injection (Figure 3 -figure supplement 1G and Figure 4-figure supplement 2F). We have also looked at the effect of chronic nicotine on the spontaneous firing rate of IPN neurons (Figure 3 -figure supplement 1F) but found no evidence for a change in basal firing properties. Similarly, the deletion of b4 had no effect on the spontaneous activity of the recorded neurons (Figure 4-figure supplement 2F). Finally, we found no evidence for any link between the anatomical coordinates of the neurons and their basal firing rate (Figure 3-figure supplement 1E and Figure 4figure supplement 2D).

      Reviewer #3 (Public Review):

      The manuscript by Mondoloni et al characterizes two-bottle choice oral nicotine consumption and associated neurobiological phenotypes in the antiparticle nucleus (IPN) using mice. The paper shows that mice exhibit differential oral nicotine consumption and correlate this difference with nicotine-evoked inward currents in neurons of the IPN. The beta4 nAChR subunit is likely involved in these responses. The paper suggests that prolonged exposure to nicotine results in reduced nAChR functional responses in IPN neurons. Many of these results or phenotypes are reversed or reduced in mice that are null for the beta4 subunit. These results are interesting and will add a contribution to the literature. However, there are several major concerns with the nicotine exposure model and a few other items that should be addressed.

      Strengths:

      Technical approaches are well-done. Oral nicotine, electrophysiology, and viral re-expression methods were strong and executed well. The scholarship is strong and the paper is generally well-written. The figures are high-quality.

      We would like to thank the reviewer for his/her comments and suggestions on how to improve the manuscript.

      Weaknesses:

      Two bottle choice (2BC) model. 2BC does not examine nicotine reinforcement, which is best shown as a volitional preference for the drug over the vehicle. Mice in this 2BC assay (and all such assays) only ever show indifference to nicotine at best - not preference. This is seen in the maximal 50% preference for the nicotine-containing bottle. 2BC assays using tastants such as saccharin are confounded. Taste responses can very likely differ from primary reinforcement and can be related to peripheral biology in the mouth/tongue rather than in the brain reward pathway.

      The two-bottle nicotine drinking test is a commonly used method to study addiction in mice (Matta, S. G. et al. 2006. Guidelines on nicotine dose selection for in vivo research. Psychopharmacology 190, 269–319). Like all methods, it has its limitations, but it also allows for different aspects to be addressed than those covered by selfadministration protocols. The two-bottle nicotine drinking test simply measures the animals' preference for a solution containing nicotine over a control solution without nicotine: the animals are free to choose nicotine or not, which allows to evaluate sensitivity and avoidance thresholds. What we show in this paper is precisely that despite interindividual differences in the way the drug is used (passively or actively), a significant proportion of the animals avoids the nicotine bottle at a certain concentration, suggesting that we are dealing with individual characteristics that are interesting to identify in the context of addiction and vulnerability. We agree that the twobottle choice test cannot provide as much information about the reinforcing effects of the drug as selfadministration procedures. We are aware of the limitations of the method and were careful not to interpret our data in terms of reinforcement to the drug. For instance, mice that consume nicotine were called “non-avoiders” and not “consumers”. We added a few sentences at the beginning of the discussion to highlight these limitations.

      The reviewer states that the mice in this 2BC assay (and all such assays) “only ever show indifference to nicotine at best - not preference”. This is seen in the maximal 50% preference for the nicotine-containing bottle. While this is true on average, it isn’t when we look at individual profiles, as we did here. We clearly observed that some mice have a strong preference for nicotine and, conversely, that some mice actively avoid nicotine after a certain concentration is proposed in the bottle.

      Regarding tastants, we indeed used saccharine to hide the bitter taste of nicotine and prevent taste-related side bias. This is a classical (though not perfect) paradigm in the field of nicotine research (Matta, S. G. et al. 2006. Guidelines on nicotine dose selection for in vivo research. Psychopharmacology 190, 269–319). To evaluate whether different sensitivities to the bitterness of nicotine may explain the interindividual differences in nicotine consumption we performed new experiments (as suggested by all three reviewers). In this two-bottle choice experiment, mice were first proposed a high concentration of nicotine (100 µg/ml) which has previously been shown to induce avoidance behavior in mice (Figure 3C). Then, mice were offered three increasing concentrations of quinine: 30, 100 and 300 µM. Quinine avoidance was dose dependent, as expected: it was moderate for 30 µM but almost absolute for 300 µM quinine. We then investigated whether nicotine and quinine avoidances were linked. We found no correlation between nicotine and quinine preference (new Figure: Figure 1- supplementary figure 1D). This new experiment strongly suggests that aversion to the drug is not directly tied to the sensitivity of mice to the bitter taste of nicotine. Other results reinforce this conclusion. First, none of the b4-/- mice (0/13) showed aversion to nicotine, whereas about half of the virally-rescued animals (8/17, b4 re-expressed in the IPN of b4-/- mice) showed nicotine aversion, a proportion similar to the one observed in WT mice. This experiment makes a clear, direct link between the expression of b4 nAChRs in the IPN and aversion to the drug. Furthermore, we also verified that the sensitivity of b4-/- mice to bitterness is not different from that of WT mice (new Figure 4 - figure supplement 1B). This new result indicates that the reason why b4-/- mice consume more nicotine than WT mice is not because they have a reduced sensitivity bitterness. Together, these new experiments strongly suggests that interindividual differences in sensitivity to the bitterness of nicotine play little role in nicotine consumption behavior in mice.

      Moreover, this assay does not test free choice, as nicotine is mixed with water which the mice require to survive. Since most concentrations of nicotine are aversive, this may create a generalized conditioned aversion to drinking water - detrimental to overall health and a confounding factor.

      Mice are given a choice between two bottles, only one of which contains nicotine. Hence, even though their choices are not fully free (they are being presented with a limited set of options), mice can always decide to avoid nicotine and drink from the bottle containing water only. We do not understand how this situation may create a generalized aversion to drinking. In fact, we have never observed any mouse losing weight or with deteriorated health condition in this test, so we don’t think it is a confounding factor.

      What plasma concentrations of nicotine are achieved by 2BC? When nicotine is truly reinforcing, rodents and humans titrate their plasma concentrations up to 30-50 ng/mL. The Discussion states that oral self-administration in mice mimics administration in human smokers (lines 388-389). This is unjustified and should be removed. Similarly, the paragraph in lines 409-423 is quite speculative and difficult or impossible to test. This paragraph should be removed or substantially changed to avoid speculation. Overall, the 2BC model has substantial weaknesses, and/or it is limited in the conclusions it will support.

      The reviewer must have read another version of our article, because these sentences and paragraphs are not present in our manuscript.

      Regarding the actual concentration of nicotine in the plasma, this is indeed a good question. We have actually measured the plasma concentrations of nicotine for another study (article in preparation). The results from this experiment can be found below. The half-life of nicotine is very short in the blood and brain of mice (about 6 mins, see Matta, S. G. et al. 2006. Guidelines on nicotine dose selection for in vivo research. Psychopharmacology 190, 269–319), making it very hard to assess. Therefore, we also assessed the plasma concentration of cotinine, the main metabolite of nicotine. We compared 4 different conditions: home-cage (forced drinking of 100 ug/ml nicotine solution); osmotic minipump (OP, 10 mg/kg/d, as in our current study); Souris-city (a large social environment developed by our group, see Torquet et al. Nat. Comm. 2018); and the two-bottle choice procedure (when a solution of nicotine 100 ug/ml was proposed). The concentrations of plasma nicotine found were very low for all groups that drank nicotine, but not for the group that received nicotine through the osmotic minipump group. This is most likely because mice did not drink any nicotine in the hour prior to being sampled and all nicotine was metabolized. Indeed, when we look at the plasma concentration of cotinine, we see that cotinine was present in all of the groups. The plasma concentration of cotinine was similar in the groups for which “consumption” was forced: forced drinking in the home cage (HC) or infusion through osmotic minipump. This indicates that the plasma concentration of cotinine is similar whether mice drink nicotine (100 ug/ml) or whether nicotine is infused with the minipump (10 mg/kg/d). For Souris city and the two-bottle choice procedure, the cotinine concentrations were in the same range (mostly between 0-100 ng/ml). Globally, the concentrations of nicotine and cotinine found in the plasma of mice that underwent the two-bottle choice procedure are in the range of what has been previously described (Matta, S. G. et al. 2006. Guidelines on nicotine dose selection for in vivo research. Psychopharmacology 190, 269–319).

      Regarding the limitations of the two-bottle choice test, we discuss them more extensively in the current version of the manuscript.

      Statistical testing on subgroups. Mice are run through an assay and assigned to subgroups based on being classified as avoiders or non-avoiders. The authors then perform statistical testing to show differences between the avoiders and non-avoiders. It is circular to do so. When the authors divided the mice into avoiders and non-avoiders, this implies that the mice are different or from different distributions in terms of nicotine intake. Conducting a statistical test within the null hypothesis framework, however, implies that the null hypothesis is being tested. The null hypothesis, by definition, is that the groups do NOT differ. Obviously, the authors will find a difference between the groups in a statistical test when they pre-sorted the mice into two groups, to begin with. Comparing effect sizes or some other comparison that does not invoke the null hypothesis would be appropriate.

      Our analysis, which can be summarized as follows, is fairly standard (see Krishnan, V. et al. (2007) Molecular adaptations underlying susceptibility and resistance to social defeat in brain reward regions. Cell 131, 391–404). Firstly, the mice are segregated into two groups based on their consumption profile, using the variability in their behavior. The two groups are obviously statistically different when comparing their consumption. This first analytical step allows us to highlight the variability and to establish the properties of each sub-population in terms of consumption. Our analysis could support the reviewer's comment if it ended at this point. However, our analysis doesn't end here and moves on to the second step. The separation of the mice into two groups (which is now a categorical variable) is used to compare the distribution of other variables, such as mouse choice strategy and current amplitude, based on the 2 categories. The null hypothesis tested is that the value of these other variables is not different between groups. There is no a priori obvious reason for the currents recorded in the IPN to be different in the two groups. These approaches allow us to show correlations between the variables. Finally, in the third and last step, one (or several) variable(s) are manipulated to check whether nicotine consumption is modified accordingly. Manipulation was performed by exposing mice to chronic nicotine, by using mutant mice with decreased nicotinic currents, and by re-expressing the deleted nAChR subunit only in the IPN. This procedure is fairly standard, and cannot be considered as a circular analysis with data selection problem, as explained in (Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F. & Baker, C. I. (2009) Circular analysis in systems neuroscience: the dangers of double dipping. Nature Neuroscience 12, 535-540).

      Decreased nicotine-evoked currents following passive exposure to nicotine in minipumps are inconsistent with published results showing that similar nicotine exposure enhances nAChR function via several measures (Arvin et al, J Neurosci, 2019). The paper does acknowledge this previous paper and suggests that the discrepancy is explained by the fact that they used a higher concentration of nicotine (30 uM) that was able to recruit the beta4containing receptor (whereas Arvin et al used a caged nicotine that was unable to do so). This may be true, but the citation of 30 uM nicotine undercuts the argument a bit because 30 uM nicotine is unlikely to be achieved in the brain of a person using tobacco products; nicotine levels in smokers are 100-500 nM. It should be noted in the paper that it is unclear whether the down-regulated receptors would be active at concentrations of nicotine found in the brain of a smoker.

      We indeed find opposite results compared to Arvin et al., and we give possible explanations for this discrepancy in the discussion. To be honest we don’t fully understand why we have opposite results. However, we clearly observed a decreased response to nicotine, both in vitro (with 30 µM nicotine on brain slices) and in vivo (with a classical dose of 30 µg/kg nicotine i.v.), while Arvin et al. only tested nicotine in vitro.

      Regarding the reviewer’s comment about the nicotine concentration used (30 µM): we used that concentration in vitro to measure nicotine-induced currents (it’s a concentration close to the EC50 for heteromeric receptors, which will likely recruit low affinity a3b4 receptors) and to evaluate the changes in nAChR current following nicotine exposure. We did not use that concentration to induce nAChR desensitization, so we don’t really understand the argument regarding the levels of nicotine in smokers. For inducing desensitization, we used a minipump that delivers a daily dose of 10 mg/kg/day, which is the amount of nicotine mice drink in our assay.

      The statement in lines 440-41 ("we show that concentrations of nicotine as low as 7.5 ug/kg can engage the IPN circuitry") is misleading, as the concentration in the water is not the same as the concentration in the CSF since the latter would be expected to build up over time. The paper did not provide measurements of nicotine in plasma or CSF, so concluding that the water concentration of nicotine is related to plasma concentrations of nicotine is only speculative.

      The sentence “we show that concentrations of nicotine as low as 7.5 ug/kg can engage the IPN circuitry" is not in the manuscript so the reviewer must have read another version of the paper.

      The results in Figure 2E do not appear to be from a normal distribution. For example, results cluster at low (~100 pA) responses, and a fraction of larger responses drive the similarities or differences.

      Indeed, that is why we performed a non-parametric Mann-Whitney test for comparing the two groups, as indicated in the legend of figure 2E.

      10 mg/kg/day in mice or rats is likely a non-physiological exposure to nicotine. Most rats take in 1.0 to 1.5 mg/kg over a 23-hour self-administration period (O'Dell, 2007). Mice achieve similar levels during SA (Fowler, Neuropharmacology 2011). Forced exposure to 10 mg/kg/day is therefore 5 to 10-fold higher than rodents would ever expose themselves to if given the choice. This should be acknowledged in a limitations section of the Discussion.

      The two-bottle choice task is very different from nicotine self-administration procedures in terms of administration route: oral versus injected (in the blood or in the brain), respectively. Therefore, the quantities of drug consumed cannot be directly compared. In our manuscript, mice consume on average 10 mg/kg/day of nicotine at the highest nicotine concentration tested, which is fully consistent with what was already published in many studies (20 mg/kg/day in Frahm et al. Neuron 2013, 5-10 mg/kg/day in Bagdas et al., NP 2020, 10-20 mg/kg/day in Bagdas et al. NP2019, to cite a few...). Hence, we used that concentration of nicotine (10 mg/kg/d) for chronic administration of nicotine using minipumps. This is also a nicotine concentration that is classically used in osmotic minipumps for chronic administration of nicotine: 10 mg/kg/d in Dongelmans et al. Nat. Com 2021 (our lab), 12 mg/kg/d in Arvin et al. J. Neuro. 2019 (Drenan lab), 12 mg/kg/d in Lotfipour et al. J. Neuro. 2013 (Boulter lab) etc… Therefore, we do not see the issue here.

      Are the in vivo recordings in IPN enriched or specific for cells that have a spontaneous firing at rest? If so, this may or may not be the same set/type of cells that are recorded in patch experiments. The results could be biased toward a subset of neurons with spontaneous firing. There are MANY different types of neurons in IPN that are largely intermingled (see Ables et al, 2017 PNAS), so this is a potential problem.

      It is true that there are many types of neurons in the IPN. In-vivo electrophysiology and slice electrophysiology should be considered as two complementary methods to obtain detailed properties of IPN neurons. The populations sampled by these two methods are certainly not identical (IPR in patch -clamp versus mostly IPR and IPC in vivo), and indeed only spontaneously active neurons are recorded in in-vivo electrophysiology. The question is whether this is or not a potential problem. The results we obtained using in-vivo and brain-slice electrophysiology are consistent (i.e., a decreased response to nicotine), which indicates that our results are robust and do not depend on the selection of a particular subpopulation. In addition, we now provide the maps of the neurons recorded both in slices and in vivo (see supplementary figures, and response to the other two referees). We show that, overall, there is no bias sampling between the different groups. Together, these new analyses strongly suggest that the differences we observe between the groups are not due to sampling issues. We have added the Ables 2017 reference and are discussing neuron variability more extensively in the revised manuscript.

      Related to the above issue, which of the many different IPN neuron types did the group re-express beta4? Could that be controlled or did beta4 get re-expressed in an unknown set of neurons in IPN? There is insufficient information given in the methods for verification of stereotaxic injections.

      Re-expression of b4 was achieved with a strong, ubiquitous promoter (pGK), hence all cell types should in principle be transduced. This is now clearly stated in the result section, the figure legend and the method section. Unfortunately, we had no access to a specific mouse line to restrict expression of b4 to b4-expressing cells, since the b4-Cre line of GENSAT is no more alive. This mouse line was problematic anyways because expression levels of the a3, a5 and b4 nAChR subunits, which belong to the same gene cluster, were reported to be affected. Yet, we show in this article that deleting b4 leads to a strong reduction of nicotine-induced currents in the IPR (80%, patch-clamp), and of the response to nicotine in vivo (65%). These results indicate that b4 is strongly expressed in the IPN, likely in a large majority of IPR and IPC neurons (see also our response to reviewer 1). In addition, we show that our re-expression strategy restores nicotine-induced currents in patch-clamp experiments and also the response to nicotine in vivo (new Figure 5C). Non-native expression levels could potentially be achieved (e.g. overexpression) but this is not what we observed: responses to nicotine were restored to the WT levels (in slices and in vivo). And importantly this strategy rescued the WT phenotype in terms of nicotine consumption. Expression of b4 alone in cells that do not express any other nAChR subunit (as, presumably, in the lateral parts of the IPN, see GENSAT images above) should not produce any functional nAChR, since alpha subunits are mandatory to produce functional receptors. As specified in the manuscript, proper transduction of the IPN was verified using post-hoc immunochemistry, and mice with transduction of b4 in the VTA were excluded from the analyses.

      Data showing that alpha3 or beta4 disruption alters MHb/IPN nAChR function and nicotine 2BC intake is not novel. In fact, some of the same authors were involved in a paper in 2011 (Frahm et al., Neuron) showing that enhanced alpha3beta4 nAChR function was associated with reduced nicotine consumption. The present paper would therefore seem to somewhat contradict prior findings from members of the research group.

      Frahm et al used a transgenic mouse line (called TABAC) in which the expression of a3b4 receptor is increased, and they observed reduced nicotine consumption. We do the exact opposite: we reduce (a3)b4 receptor expression (using the b4 knock-out line, or by putting mice under chronic nicotine), and observe increased consumption. There is thus no contradiction. In fact, we discuss our findings in the light of Frahm et al. in the discussion section.

      Sex differences. All studies were conducted in male mice, therefore nothing was reported regarding female nicotine intake or physiology responses. Nicotine-related biology often shows sex differences, and there should be a justification provided regarding the lack of data in females. A limitations section in the Discussion section is a good place for this.

      We agree with the reviewer. We added a sentence in the discussion.

    1. Author Response

      Reviewer #2 (Public Review):

      We are in a golden age for comparative genomics and this is a prime example of the utility of the field. "Vision-related convergent gene losses reveal SERPINE3's unknown role in the eye" details the discovery of a function for a previously uncharacterized gene in regulating organ development in evolution. The authors intersect patterns of gene loss, quantified as the percentage of intact coding sequence, with visual acuity scores across Mammalia. This analysis identified 26 significant genes that have undergone convergent loss with phenotypic decreases of vision. Many of those genes have previously been annotated in the eye, indicating the analysis was successful and suggesting the uncharacterized genes may also have roles there.

      The authors ruled out the top hit due to its specific expression in the testis, and instead performed an in-depth characterization of the second hit, SERPINE3. This included an impressive breadth of comparative genomics across 430 placental mammals, carefully describing the many and diverse genetic perturbations of SERPINE3 in lineages with low visual acuity. These results are persuasive that SERPINE3 is involved in vision, and it is a great example and description of gene loss in adaptation.

      Critically, the authors validated the role of SERPINE3 in eye structure by confirming expression patterns in the eye, and characterizing its knockout in zebrafish, demonstrating both qualitative and quantitative impairments to eye structure. This is particularly satisfying as many comparative genomics make such associations but never validate the result. Here, validation of SERPINE3 was an undeniable success and puts a functional annotation to a previously uncharacterized gene. The utility of comparative genomics and zebrafish genetic models has been expertly capitalized upon and there is no doubt our knowledge of eye genetics has increased.

      We thank the reviewer for these kind words and the valuable comments that we addressed below.

      While these end results are certainly valuable to the community, details regarding the statistics and filters underlying the initial convergence analysis are too sparse to interpret. The impressive false discovery rate of the top hits is called into question when the top hit (corrected p-value < 1.1E-15 with visual acuity < 2) is subsequently skipped due to its specific expression in the testes. Given this disconnect, and without knowing the rationale and consequences of the various filters, it is difficult to get a sense of the accuracy and robustness of these p-values. Plots of p-value distributions across the dataset would demonstrate the method is statistically sound and would provide the backdrop to interpret the top hits of interest.

      We have now simplified the workflow to detect convergent gene losses in species with lower visual acuity values and explained the rationale of each step (this is detailed in the responses below). We would like to mention that our screen may find genes that are associated with other phenotypes that are shared between species exhibiting lower visual acuity values. For example, several of these species are subterranean mammals, which share other traits and adaptations to their environment. While we do not know to which trait the loss of the testis gene TSACC is associated with, its FDR is only slightly lower than the FDR of the second-ranked SERPINE3 (FDR 1.1E-6 vs. 1.5E-6).

      As suggested, we plotted the distribution of the raw P-values of all 13172 genes for which we ran the phylogenetic least square approach. This distribution has a peak at low P-values, indicating that some genes are preferentially lost in the poor-vision mammals. The distribution also showed a peak at ~0.5 and at ~1. We investigated which patterns of the %intact reading frame values appear to contribute to these two peaks.

      Many genes with P-values of ~0.5 have one high-acuity species (blue), where the %intact value is slightly reduced, whereas other high- and poor-acuity (red) species all have a 100% intact reading frame. Two examples, where rhesus or dolphin have lower %intact values are shown below:

      Similarly, many genes with P-values of ~1 have two or more high-acuity species, where the %intact value is reduced, whereas all other species have a 100% intact reading frame.

      Since these genes have lower %intact values in a few high-acuity species, the high P-values likely capture a negative association with our trait of interest. While it is not clear why many P-values are around 0.5 or 1, it is clear that these genes are not associated with poor vision.

      Our main purpose of using the phylogenetic least square approach was to rank the genes by their association with the poor vision phenotype. Importantly, the top-ranked candidates are all preferentially lost in low-acuity mammals, which is evident from Figure 1A. Furthermore, for SERPINE3, where we experimentally confirmed an eye-related function, three different screens with different phenotype definitions robustly support a preferential loss in low-acuity species (detailed below).

      Notes on how many genes pass each filter, and what kinds of genes, would allow interpretation of possible bias in those filters and how they interact with the convergence analysis.

      We thank the reviewer for this suggestion. As detailed below, we have now simplified the filtering procedure, justified the filter steps in the revised methods section, and added a flowchart (Figure 1 - supplement figure 1) describing each step and how many genes passed each filter (below).

      For instance, the slight changes in visual acuity cutoffs have non-obvious operational consequences for vision, yet large impacts on the resulting gene lists, making it difficult to interpret how the measure is functioning. Most importantly, a negative control in the convergence analysis, demonstrating a null p-value distribution with the same filters, would assuage most concerns.

      The reviewer is correct that changes in the visual acuity cutoff leads to different gene lists because the screen searches for genes preferentially lost in different species. However, our screens using three visual acuity cutoffs consistently find SERPINE3 as a candidate in the top 8 genes (Figure 1 - source data 5,6), showing that the association with lower visual acuity is robust for this gene.

      As suggested, we have now run a negative control screen. For the negative control, we considered close relatives of the low-acuity species as trait-loss species. Specifically, we selected elephant, rhinoceros, horse, the two flying foxes, guinea pig, degu and squirrel. These 8 species represent five independent lineages. All other species (including the low-acuity species) were treated as trait-preserving species. A Forward genomics screen with otherwise identical filter parameters retrieved only two hits, TUBAL3 and TRIM52, which have no known function in the eye. This supports the specificity of our screen.

      We added this to the main text:

      “To confirm the specificity of these results, we performed a control screen for genes that are preferentially lost in high-acuity sister species of the low-acuity mammals. This control screen retrieved only two genes, none of which have known functions in the eye (Figure 1 - source data 4). Together, this shows that our genome-wide screen for genes preferentially lost in low-acuity species successfully retrieved known vision-related genes.”

      and Methods:

      “As a control to ensure that a Forward Genomics screen does not always retrieve vision-related genes, we ran a new screen, searching for genes preferentially lost in high-acuity sister species (elephant, rhinoceros, horse, two flying foxes, guinea pig, degu, squirrel) of the low-acuity mammals that we used in the original screen. All other species including the other high-acuity mammals were then treated as background (Figure 1 - source data 4).“

    1. Author Response

      Reviewer #1 (Public Review):

      1) Although I found the introduction well written, I think it lacks some information or needs to develop more on some ideas (e.g., differences between the cerebellum and cerebral cortex, and folding patterns of both structures). For example, after stating that "Many aspects of the organization of the cerebellum and cerebrum are, however, very different" (1st paragraph), I think the authors need to develop more on what these differences are. Perhaps just rearranging some of the text/paragraphs will help make it better for a broad audience (e.g., authors could move the next paragraph up, i.e., "While the cx is unique to mammals (...)").

      We have added additional context to the introduction and developed the differences between cerebral and cerebellar cortex, also re-arranging the text as suggested.

      2) Given that the authors compare the folding patterns between the cerebrum and cerebellum, another point that could be mentioned in the introduction is the fact that the cerebellum is convoluted in every mammalian species (and non-mammalian spp as well) while the cerebrum tends to be convoluted in species with larger brains. Why is that so? Do we know about it (check Van Essen et al., 2018)? I think this is an important point to raise in the introduction and to bring it back into the discussion with the results.

      We now mention in the introduction the fact that the cerebellum is folded in mammals, birds and some fishes, and provide references to the relevant literature. We have also expanded our discussion about the reasons for cortical folding in the discussion, which now contains a subsection addressing the subject (this includes references to the work of Van Essen).

      3) In the results, first paragraph, what do the authors mean by the volume of the medial cerebellum? This needs clarification.

      We have modified the relevant section in the results, and made the definition of the medial cerebellum more clear indicating that we refer to the vermal region of the cerebellum.

      4) In the results: When the authors mention 'frequency of cerebellar folding', do they mean the degree of folding in the cerebellum? At least in non-mammalian species, many studies have tried to compare the 'degree or frequency of folding' in the cerebellum by different proxies/measurements (see Iwaniuk et al., 2006; Yopak et al., 2007; Lisney et al., 2007; Yopak et al., 2016; Cunha et al., 2022). Perhaps change the phrase in the second paragraph of the result to: "There are no comparative analyses of the frequency of cerebellar folding in mammals, to our knowledge".

      We have modified the subsection in the methods referring to the measurement of folial width and folial perimeter to make the difference more clear. The folding indices that have been used previously (which we cite) are based on Zilles’s gyrification index. This index provides only a global idea of degree of folding, but it’s unable to distinguish a cortex with profuse shallow folds from one with a few deep ones. An example of this is now illustrated in Fig. 3d, where we also show how that problem is solved by the use of our two measurements (folial width and perimeter). The problem is also discussed in the section about the measurement of folding in the discussion section:

      “Previous studies of cerebellar folding have relied either on a qualitative visual score (Yopak et al. 2007, Lisney et al. 2008) or a “gyrification index” based on the method introduced by Zilles et al. (1988, 1989) for the study of cerebral folding (Iwaniuk et al. 2006, Cunha et al. 2020, 2021). Zilles’s gyrification index is the ratio between the length of the outer contour of the cortex and the length of an idealised envelope meant to reflect the length of the cortex if it were not folded. For instance, a completely lissencephalic cortex would have a gyrification index close to 1, while a human cerebral cortex typically has a gyrification index of ~2.5 (Zilles et al. 1988). This method has certain limitations, as highlighted by various researchers (Germanaud et al. 2012, 2014, Rabiei et al. 2018, Schaer et al. 2008, Toro et al. 2008, Heuer et al. 2019). One important drawback is that the gyrification index produces the same value for contours with wide variations in folding frequency and amplitude, as illustrated in Fig. 3d. In reality, folding frequency (inverse of folding wavelength) and folding amplitude represent two distinct dimensions of folding that cannot be adequately captured by a single number confusing both dimensions. To address this issue we introduced 2 measurements of folding: folial width and folial perimeter. These measurements can be directly linked to folding frequency and amplitude, and are comparable to the folding depth and folding wavelength we introduced previously for cerebral 3D meshes (Heuer et al. 2019). By using these measurements, we can differentiate folding patterns that could be confused when using a single value such as the gyrification index (Fig. 3d). Additionally, these two dimensions of folding are important, because they can be related to the predictions made by biomechanical models of cortical folding, as we will discuss now.”

      5) Sultan and Braitenberg (1993) measured cerebella that were sagittally sectioned (instead of coronal), right? Do you think this difference in the plane of the section could be one of the reasons explaining different results on folial width between studies? Why does the foliation index calculated by Sultan and Braitenberg (1993) not provide information about folding frequency?

      The measurement of foliation should be similar as far as enough folds are sectioned perpendicular to their main axis. This will be the case for folds in the medial cerebellum (vermis) sectioned sagittally, and for folds in the lateral cerebellum sectioned coronally. The foliation index of Sultan and Braitenberg does not provide a similar account of folding frequency as we do because they only measure groups of folia (what some called lamellae), whereas we measure individual folia. It is not easy to understand exactly how Sultan and Braitenberg proceeded from their paper. We contacted Prof. Fahad Sultan (we acknowledge his help in our manuscript). Author response image 1 provides a more clear description of their procedure:

      Author response image 1.

      As Author response image 1 shows, each of the structures that they call a fold is composed of several folia, and so their measurements are not comparable with ours which measure individual folia (a). The flattened representation (b) is made by stacking the lengths of the fold axes (dashed lines), separating them by the total length of each fold (the solid lines), which each may contain several folia.

      6) Another point that needs to be clarified is the log transformation of the data. Did the authors use log-transformed data for all types of analyses done in the study? Write this information in the material and methods.

      Yes, we used the log10 transformation for all our measurements. This is now mentioned in the methods section, and again in the section concerning allometry. We are including a link to all our code to facilitate exact replication of our entire method, including this transformation.

      7) The discussion needs to be expanded. The focus of the paper is on the folding pattern of the cerebellum (among different mammalian species) and its relationship with the anatomy of the cerebrum. Therefore, the discussion on this topic needs to be better developed, in my opinion (especially given the interesting results of this paper). For example, with the findings of this study, what can we say about how the folding of the cerebellum is determined across mammals? The authors found that the folial width, folial perimeter, and thickness of the molecular layer increase at a relatively slow rate across the species studied. Does this mean that these parameters have little influence on the cerebellar folding pattern? What mostly defines the folding patterns of the cerebellum given the results? Is it the interaction between section length and area? Can the authors explain why size does not seem to be a "limiting factor" for the folding of the cerebellum (for example, even relatively small cerebella are folded)? Is that because the 'white matter' core of the cerebellum is relatively small (thus more stress on it)?

      We have expanded the discussion as suggested, with subsections detailing the measuring of folding, the modelling of folding for the cerebrum and the cerebellum, and the role that cerebellar folding may play in its function. We refer to the literature on cortical folding modelling, and we discuss our results in terms of the factors that this research has highlighted as critical for folding. From the discussion subsection on models of cortical folding:

      “The folding of the cerebral cortex has been the focus of intense research, both from the perspective of neurobiology (Borrell 2018, Fernández and Borrell 2023) and physics (Toro and Burnod 2005, Tallinen et al. 2014, Kroenke and Bayly 2018). Current biomechanical models suggest that cortical folding should result from a buckling instability triggered by the growth of the cortical grey matter on top of the white matter core. In such systems, the growing layer should first expand without folding, increasing the stress in the core. But this configuration is unstable, and if growth continues stress is released through cortical folding. The wavelength of folding depends on cortical thickness, and folding models such as the one by Tallinen et al. (2014) predict a neocortical folding wavelength which corresponds well with the one observed in real cortices. Tallinen et al. (2014) provided a prediction for the relationship between folding wavelength λ and the mean thickness (𝑡) of the cortical layer: λ = 2π𝑡(µ/(3µ𝑠))1/3. (...)”

      From this biomechanical framework, our answers to the questions of the Reviewer would be:

      • How is the folding of the cerebellum determined across mammals? By the expansion of a layer of reduced thickness on top of an elastic layer (the white matter)

      • Folial width, folial perimeter, and thickness of the molecular layer increase at a relatively slow rate across the species studied. Does this mean that these parameters have little influence on the cerebellar folding pattern? On the contrary, that indicates that the shape of individual folia is stable, providing the smallest level of granularity of a folding pattern. In the extreme case where all folia had exactly the same size, a small cerebellum would have enough space to accommodate only a few folia, whereas a large cerebellum would accommodate many more.

      • What mostly defines the folding patterns of the cerebellum given the results? Is it the interaction between section length and area? It’s the mostly 2D expansion of the cerebellar cortical layer and its thickness.

      • Can the authors explain why size does not seem to be a "limiting factor" for the folding of the cerebellum? Because even a cerebellum of very small volume would fold if its cortex were thin enough and expanded sufficiently. That’s why the cerebellum folds even while being smaller than the cerebrum: because its cortex is much thinner.

      8) One caveat or point to be raised is the fact that the authors use the median of the variables measured for the whole cerebellum (e.g., median width and median perimeter across all folia). Although the cerebellum is highly uniform in its gross internal morphology and circuitry's organization across most vertebrates, there is evidence showing that the cerebellum may be organized in different functional modules. In that way, different regions or folia of the cerebellum would have different olivo-cortico-nuclear circuitries, forming, each one, a single cerebellar zone. Although it is not completely clear how these modules/zones are organized within the cerebellum, I think the authors could acknowledge this at the end of their discussion, and raise potential ideas for future studies (e.g., analyse folding of the cerebellum within the brain structure - vermis vs lateral cerebellum, for example). I think this would be a good way to emphasize the importance of the results of this study and what are the main questions remaining to be answered. For example, the expansion of the lateral cerebellum in mammals is suggested to be linked with the evolution of vocal learning in different clades (see Smaers et al., 2018). An interesting question would be to understand how foliation within the lateral cerebellum varies across mammalian clades and whether this has something to do with the cellular composition or any other aspect of the microanatomy as well as the evolution of different cognitive skills in mammals.

      We now address this point in a subsection of the discussion which details the implications of our methodological decisions and the limitations of our approach. It is true that the cerebellum is regionally variable. Our measurements of folial width, folial perimeter and molecular layer thickness are local, and we should be able to use them in the future to study regional variation. However, this comes with a number of difficulties. First, it would require sampling all the cerebellum (and the cerebrum) and not just one section. But even if that were possible that would increase the number of phenotypes, beyond the current scope of this study. Our central question about brain folding in the cerebellum compared to the cerebrum is addressed by providing data for a substantial number of mammalian species. As indicated by Reviewer #3, adding more variables makes phylogenetic comparative analyses very difficult because the models to fit become too large.

      Reviewer #2 (Public Review):

      1) The methods section does not address all the numerical methods used to make sense of the different brain metrics.

      We now provide more detailed descriptions of our measurements of foliation, phylogenetic models, analysis of partial correlations, phylogenetic principal components, and allometry. We have added illustrations (to Figs. 3 and 5), examples and references to the relevant literature.

      2) In the results section, it sometimes makes it difficult for the reader to understand the reason for a sub-analysis and the interpretation of the numerical findings.

      The revised version of our manuscript includes motivations for the different types of analyses, and we have also added a paragraph providing a guide to the structure of our results.

      3) The originality of the article is not sufficiently brought forward:

      a) the novel method to detect the depth of the molecular layer is not contextualized in order to understand the shortcomings of previously-established methods. This prevents the reader from understanding its added value and hinders its potential re-use in further studies.

      The revised version of the manuscript provides additional context which highlights the novelty of our approach, in particular concerning the measurement of folding and the use of phylogenetic comparative models. The limitations of the previous approaches are stated more clearly, and illustrated in Figs. 3 and 5.

      b) The numerous results reported are not sufficiently addressed in the discussion for the reader to get a full grasp of their implications, hindering the clarity of the overall conclusion of the article.

      Following the Reviewer’s advice, we have thoroughly restructured our results and discussion section.

      Reviewer #3 (Public Review):

      1) The first problem relates to their use of the Ornstein-Uhlenbeck (OU) model: they try fitting three evolutionary models, and conclude that the Ornstein-Uhlenbeck model provides the best fit. However, it has been known for a while that OU models are prone to bias and that the apparent superiority of OU models over Brownian Motion is often an artefact, a problem that increases with smaller sample sizes. (Cooper et al (2016) Biological Journal of the Linnean Society, 2016, 118, 64-77).

      Cooper et al.’s (2016) article “A Cautionary Note on the Use of Ornstein Uhlenbeck Models in Macroevolutionary Studies” suggests that comparing evolutionary models using the model’s likelihood leads often to incorrectly selecting OU over BM even for data generated from a BM process. However, Grabowski et al (2023) in their article ‘A Cautionary Note on “A Cautionary Note on the Use of Ornstein Uhlenbeck Models in Macroevolutionary Studies”’ suggest that Cooper et al.’s (2016) claim may be misleading. The work of Clavel et al. (2019) and Clavel and Morlon (2017) shows that the penalised framework implemented in mvMORPH can successfully recover the parameters of a multivariate OU process. To address more directly the concern of the Reviewer, we used simulations to evaluate the chances that we would decide for an OU model when the correct model was BM – a similar procedure to the one used by Cooper et al.’s (2016). However, instead of using the likelihood of the fitted models directly as Cooper et al. (2016) – which does not control for the number of parameters in the model – we used the Akaike Information Criterion, corrected for small sample sizes: AICc. The standard Akaike Information Criterion takes the number of parameters of the model into account, but this is not sufficient when the sample size is small. AICc provides a score which takes both aspects into account: model complexity and sample size. This information has been added to the manuscript:

      “We selected the best fitting model using the Akaike Information Criterion (AIC), corrected for 𝐴𝐼𝐶 = − 2 𝑙𝑜𝑔(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + 2 𝑝. This approximation is insufficient when the𝑝 sample size small sample sizes (AICc). AIC takes into account the number of parameters in the model: is small, in which case an additional correction is required, leading to the corrected AIC: 𝐴𝐼𝐶𝑐 = 𝐴𝐼𝐶 + (2𝑝2 + 2𝑝)/(𝑛 − 𝑝 − 1), where 𝑛 is the sample size.”

      In 1000 simulations of 9 correlated multivariate traits for 56 species (i.e., 56*9 data points) using our phylogenetic tree, only 0.7% of the times we would decide for OU when the real model was BM.

      2) Second, for the partial correlations (e.g. fig 7) and Principal Components (fig 8) there is a concern about over-fitting: there are 9 variables and only 56 data points (violating the minimal rule of thumb that there should be >10 observations per parameter). Added to this, the inclusion of variables lacks a clear theoretical rationale. The high correlations between most variables will be in part because they are to some extent measuring the same things, e.g. the five different measures of cerebellar anatomy which include two measures of folial size. This makes it difficult to separate their effects. I get that the authors are trying to tease apart different aspects of size, but in practice, I think these results (e.g. the presence of negative coefficients in Fig 7) are really hard or impossible to interpret. The partial correlation network looks like a "correlational salad" rather than a theoretically motivated hypothesis test. It isn't clear to me that the PC analyses solve this problem, but it partly depends on the aims of these analyses, which are not made very clear.

      PCA is simply a rigid rotation of the data, distances among multivariate data points are all conserved. Neither our PCA nor our partial correlation analysis involve model fitting, the concept of overfitting does not apply. PCA and partial correlations are also not used here for hypothesis testing, but as exploratory methods which provide a transformation of the data aiming at capturing the main trends of multivariate change. The aim of our analysis of correlation structure is precisely to avoid the “correlational salad” that the Reviewer mentions. The Reviewer is correct: all our variables are correlated to a varying degree (note that there are 56 data points per variable = 56*9 data points, not just 56 data points). Partial correlations and PCA aim at providing a principled way in which correlated measurements can be explored. In the revised version of the manuscript we include a more detailed description of partial correlations and PCA (phylogenetic). Whenever variables measure the same thing, they will be combined into the same principal component (these are the combinations shown in Fig. 8 b and d). Additionally, two variables may be correlated because of their correlation with a third variable (or more). Partial correlations address this possibility by looking at the correlations between the residuals of each pair of variables after all other variables have been covaried out. We provide a simple example which should make this clear, providing in particular an intuition for the meaning of negative correlations:

      “All our phenotypes were strongly correlated. We used partial correlations to better understand pairwise relationships. The partial correlation between 2 vectors of measurements a and b is the correlation between their residuals after the influence of all other measurements has been covaried out. Even if the correlation between a and b is strong and positive, their partial correlation could be 0 or even negative. Consider, for example, 3 vectors of measurements a, b, c, which result from the combination of uncorrelated random vectors x, y, z. Suppose that a = 0.5 x + 0.2 y + 0.1 z, b = 0.5 x - 0.2 y + 0.1 z, and c = x. The measurements a and b will be positively correlated because of the effect of x and z. However, if we compute the residuals of a and b after covarying the effect of c (i.e., x), their partial correlation will be negative because of the opposite effect of y on a and b. The statistical significance of each partial correlation being different than 0 was estimated using the edge exclusion test introduced by Whittaker (1990).”

      The rationale for our analyses has been made more clear in the revised version of the manuscript, aided by the more detailed description of our methods. In particular, we describe better the reason for our 2 measurements of folial shape – width and perimeter – which measure independent dimensions of folding (this is illustrated in Fig. 3d).

      3) The claim of concerted evolution between cortical and cerebellar values (P 11-12) seems to be based on analyses that exclude body size and brain size. It, therefore, seems possible - or even likely - that all these analyses reveal overall size effects that similarly influence the cortex and cerebellum. When the authors state that they performed a second PC analysis with body and brain size removed "to better understand the patterns of neuroanatomical evolution" it isn't clear to me that is what this achieves. A test would be a model something like [cerebellar measure ~ cortical measure + rest of the brain measure], and this would deal with the problem of 'correlation salad' noted below.

      The answer to this question is in the partial correlation diagram in Fig. 7c. This analysis does not exclude body weight nor brain weight. It shows that the strong correlation between cerebellar area and length is supported by a strong positive partial correlation, as is the link between cerebral area and length. There is a significant positive partial correlation between cerebellar section area and cerebral section length. That is, even after covarying everything else, there is still a correlation between cerebellar section area and cerebral section length (this partial correlation is equivalent to the suggestion of the Reviewer). Additionally, there is a positive partial correlation between body weight and cerebellar section area, but not significant partial correlation between body weight and cerebral section area or length. Our approach aims at obtaining a general view of all the relationships in the data. Testing an individual model would certainly decrease the number of correlations, however, it would provide only a partial view of the problem.

      4) It is not quite clear from fig 6a that the result does indeed support isometry between the data sets (predicted 2/3 slope), and no coefficient confidence intervals are provided.

      We have now added the numerical values of the CIs to all our plots in addition to the graphical representations (grey regions) in the previous version of the manuscript. The isometry slope (0.67) is either within the CIs (both for the linear and orthogonal regressions) or at the margin, indicating that if the relationships are not isometric, they are very close to it.

      Referencing/discussion/attribution of previous findings

      5) With respect to the discussion of the relationship between cerebellar architecture and function, and given the emphasis here on correlated evolution with cortex, Ramnani's excellent review paper goes into the issues in considerable detail, which may also help the authors develop their own discussion: Ramnani (2006) The primate cortico-cerebellar system: anatomy and function. Nature Reviews Neuroscience 7, 511-522 (2006)

      We have added references to the work of Ramnani.

      6) The result that humans are outliers with a more folded cerebellum than expected is interesting and adds to recent findings highlighting evolutionary changes in the hominin human cerebellum, cerebellar genes, and epigenetics. Whilst Sereno et al (2020) are cited, it would be good to explain that they found that the human cerebellum has 80% of the surface area of the cortex.

      We have added this information to the introduction:

      “In humans, the cerebellum has ~80% of the surface area of the cerebral cortex (Sereno et al. 2020), and contains ~80% of all brain neurons, although it represents only ~10% of the brain mass (Azevedo et al. 2009)”

      7) It would surely also be relevant to highlight some of the molecular work here, such as Harrison & Montgomery (2017). Genetics of Cerebellar and Neocortical Expansion in Anthropoid Primates: A Comparative Approach. Brain Behav Evol. 2017;89(4):274-285. doi: 10.1159/000477432. Epub 2017 (especially since this paper looks at both cerebellar and cortical genes); also Guevara et al (2021) Comparative analysis reveals distinctive epigenetic features of the human cerebellum. PLoS Genet 17(5): e1009506. https://doi.org/10.1371/journal. pgen.1009506. Also relevant here is the complex folding anatomy of the dentate nucleus, which is the largest structure linking cerebellum to cortex: see Sultan et al (2010) The human dentate nucleus: a complex shape untangled. Neuroscience. 2010 Jun 2;167(4):965-8. doi: 10.1016/j.neuroscience.2010.03.007.

      The information is certainly important, and could have provided a wider perspective on cerebellar evolution, but we would prefer to keep a focus on cerebellar anatomy and address genetics only indirectly through phylogeny.

      8) The authors state that results confirm previous findings of a strong relationship between cerebellum and cortex (P 3 and p 16): the earliest reference given is Herculano-Houzel (2010), but this pattern was discovered ten years earlier (Barton & Harvey 2000 Nature 405, 1055-1058. https://doi.org/10.1038/35016580; Fig 1 in Barton 2002 Nature 415, 134-135 (2002). https://doi.org/10.1038/415134a) and elaborated by Whiting & Barton (2003) whose study explored in more detail the relationship between anatomical connections and correlated evolution within the cortico-cerebellar system (this paper is cited later, but only with reference to suggestions about the importance of functions of the cerebellum in the context of conservative structure, which is not its main point). In fact, Herculano-Houzel's analysis, whilst being the first to examine the question in terms of numbers of neurons, was inconclusive on that issue as it did not control for overall size or rest of the brain (A subsequent analysis using her data did, and confirmed the partially correlated evolution - Barton 2012, Philos Trans R Soc Lond B Biol Sci. 367:2097-107. doi: 10.1098/rstb.2012.0112.)

      We apologise for this oversight, these references are now included.

    1. Author Response

      Reviewer #1 (Public Review):

      This study focuses on elucidating the function of CD59, a small GPI-anchored glycoprotein, in Schwann cell development. Patients with CD59 deficiency suffer from neurological dysfunctions, but the link between CD59 deficiency and the development of neurological dysfunctions remains unclear. To clarify this link, the authors used zebrafish as an animal model. They generated cd59 mutant zebrafish and studied their Schwann cell development. The authors started this study by showing CD59 expression data from different sources in the Schwann cell and oligodendrocyte lineages in zebrafish and mice. They continued by demonstrating that CD59 is expressed only by a subset of developing Schwann cells, which is very interesting conceptually for the identification of different Schwann cell populations and their specific functions and also for the potential development of future techniques targeting specific Schwann cell populations. However, since the authors focused in the following parts of the article on Schwann cell development, it is unclear why they have included data on oligodendrocytes at the start of the manuscript.

      Thank you for this question. We included the data on oligodendrocytes because we wanted to be thorough and transparent. Additionally, because some of our own expression data show oligodendrocyte expression, we felt it was prudent to confirm this expression in published RNAseq datasets. Finally, we created and/or used tools to label cd59-positive cells, and we often used expression in both oligodendrocytes and Schwann cells as a readout of complete expression of these tools.

      In this study, the authors show that cd59 ablation in zebrafish leads to increased Schwann cell proliferation between 48 and 55 hpf (hours post fertilization), which is quite convincing. However, they claim that this transient increase in proliferation leads to impaired myelination and node of Ranvier formation. Unfortunately, these findings remain correlative and it appears unclear why an increased number of Schwann cells that stop proliferating at the same time-point as wild type Schwann cells would impair myelination and node of Ranvier formation. This phenotype is attributed by the authors to increased proliferation of Schwann cells between 48 and 55 hpf, which seems rather unlikely or not supported by the data currently presented. The hypomyelination phenotype is rather mild, while the impairment of node of Ranvier formation seems quite strong - however, the data currently presented is not very convincing and needs improvement.

      Thank you for your observations. With regards to how an increase in SC proliferation could impact myelination and node of Ranvier formation, although the rate of proliferation transiently increases, these excess SCs persist on the nerve. So, even though the mutants can stop developmental proliferation at the same stage, the mutants ultimately have more SCs on the nerve after proliferation has ceased. This raises the interesting question of how could more SCs lead to less myelin? To address this question, we added to the discussion to speculate on possible hypotheses as to why this is the case (please see line 510).

      With regards to comparing the strength of the myelin phenotype and the node of Ranvier phenotype, there is no reason to suspect that there is a linear relationship between myelin volume and node of Ranvier assembly. We do know that myelination and SCs are necessary for node of Ranvier assembly. So, it is very possible that any perturbation in myelination could drastically affect node of Ranvier assembly. That said, this relationship is very interesting, and we hope that the cd59 mutant model can be utilized to further investigate these questions in future studies.

      In regards to the node of Ranvier data itself, we have provided co-labeling of NF186 and NaV channels on mbpa:tagrfp-caax-positive nerves (see Figure 5 – figure supplement 1D). Using Imaris, we demonstrate that each NF186 cluster colocalizes with a NaV channel cluster. Furthermore, this colocalization only occurs within the myelinated nerve. Collectively, this data demonstrates that our quantification of nodes of Ranvier is reliable.

      The data showing an increase of complement activation in cd59 mutants is also not very convincing and should be improved.

      Thank you for sharing your concern. To address this issue, we have used Imaris to show MACs that are bound to SC membranes (see Figure 6B) for a clearer view of the data. Comparing wildtype and mutant larvae, there is a visible and significant increase in MAC binding to SC membranes when cd59 is perturbed. Additionally, we have included controls for these antibody labeling experiments to show specificity of these tools.

      In addition, the link between increased complement activation and increased proliferation remains to be proven in the context of this study, and the choice of dexamethasone as an inhibitor of complement activation does not appear to be the best choice since it is not specific to the complement.

      Thank you for sharing your concern. We agree that dexamethasone impacts other aspects of immune activation other than complement. With this in mind, we did test another drug called compstatin, which inhibits complement protein 3 (C3). Inhibition of C3 impairs all three complement pathways and would abrogate downstream assembly of MACs. Our preliminary data was very promising, demonstrating the same relationship that we see with our dexamethasone treatment (see below). However, we were unable to reproducibly get the same results in subsequent experiments after we purchased a new stock of this drug. To solve this problem, we tried compstatin from a different company as well as increasing the concentration, but none of our troubleshooting efforts yielded the same results that we had originally observed. Obviously, this is incredibly disappointing to us. So, although these results initially repeated, we did not feel it was ethical to publish this data. (In the figure below, wildtype and mutant embryos were treated with 1% DMSO or 50 µM compstatin in 1% DMSO from 24 hpf to 55 hpf. The number of SCs was quantified with a Sox10 antibody and confocal imaging at 55 hpf).

      Given these technical limitations, we ultimately decided to include the dexamethasone experiments because they were reproducible. Considering the broader effects of dexamethasone on the immune system, we have softened our claims to include inflammation as well as complement activation. That said, we hope future studies will be able to use this model to gather more information on the specific pathways that are regulating Cd59-dependent SC proliferation.

      Page 49, lines 437-439: Here the authors claim that their data "demonstrates that developmental inflammation aids in normal SC proliferation and that this process is amplified when cd59 is mutated." The data presented in Figure 6C-D and commented by the authors on page 49, lines 435-437, show however that "Dex treatment in cd59uva48 mutant embryos restored SC numbers to wildtype levels, whereas wildtype SCs were not significantly affected by Dex application". Dex (dexamethasone) was used here to inhibit inflammation and associated complement activation. Therefore, these data do not show that developmental inflammation aids in normal SC proliferation, but rather that it has no influence.

      Thank you for your comment. When compared alone, there are significantly fewer SCs in dexamethasone-treated wildtype larvae compared to DMSO-treated wildtype larvae. We have updated the figure and text to better highlight this relationship (please see Figure 7A, C and line 457). We also quantified EdU incorporation into SCs treated with dexamethasone. Here we also observed a decrease in EdU-positive SCs in wildtype larvae treated with dexamethasone, supporting our observation that developmental inflammation is contributing to normal SC proliferation (please see Figure 7B, D).

      Dexamethasone treatment: The authors claim that dexamethasone treatment, by decreasing inflammation and associated complement activation, leads to a decrease of SC proliferation in the cd59 mutant. To support this, there is only Figure 7-Figure Supplement 1 showing a decreased SC number in the mutant treated by dexamethasone as compared to vehicle-treated mutant. To strengthen this point, the authors also need to specifically quantify proliferation by EdU incorporation, as they did in Figure 4, and also cell death.

      Thank you for your comment. We have added quantification of EdU incorporation after dex treatment (please see Figure 7B, D). Dr. Feltri, the Reviewing Editor, told us that measuring apoptosis after treatment was not necessary for the revision.

      In addition, the mechanistic hypothesis of increased proliferation in cd59 mutant is that cd59 interferes with the activation of the complement and complement-induced pore formation in the plasma membrane. However, dexamethasone is not a specific inhibitor of the complement. Therefore, its potential effect on SC proliferation could be due to other effects than complement inactivation. It is unclear why the authors did not use an inhibitor of the complement that is more specific than dexamethasone.

      Thank you for your comment. Please see our previous response to this comment.

      Page 54, lines 456-457: The following statement "Collectively, these data demonstrate that inflammation-induced SC proliferation contributes to perturbed myelin and node of Ranvier development." is not accurate since these data remain correlative. Indeed, there is in this study nothing showing that increased SC proliferation between 48 and 55 hpf leads to perturbed myelin and node of Ranvier development. In addition, the term "inflammation" is not precise enough here. What the authors attempt to show is an increase of complement activation due to the absence of cd59 expression in SCs. The authors did not try to induce inflammation in wild type animals to see whether this induces proliferation and perturbed myelin and node of Ranvier development. They also did not try to directly knock down C8/C9 in cd59 mutants to see whether they would rescue the phenotype of the cd59 mutant, at least to some extent. In addition, their statement mentioned above needs to be more precise by stating that their findings apply to cd59 mutants and not to wild type animals.

      Thank you for your comment. Please see our previous responses to these comments.

      Reviewer #3 (Public Review):

      Wiltbank and colleagues explore the function of CD59 in developmental Schwann cell myelination. Using previously published transcriptomics data sets they arrive at CD59 as a differentially expressed gene in myelinating glia. In addition, patients with pathogenic variants have neuropathy. The authors construct a transgenic zebrafish reporter line for cd59. Surprisingly, it labels a very, very small percentage of Schwann cells (less than 10% throughout development). The authors then construct several loss-of-function mutants for cd59. They report these mutants have increased numbers of Schwann cells, but nerves are smaller and EM shows they have reduced the number of myelin wraps. Consistent with impaired myelination they also observe fewer nodes of Ranvier. The authors suggest loss of cd59 results in increased MAC deposition on myelinating Schwann cells. Remarkably, using an inhibitor of inflammation (dexamethasone), the authors show that they can normalize/rescue the main phenotypes: 1) normalize the number of SCs, 2, dramatically improve myelination to normal nerve volumes, and 3) rescue node of Ranvier formation. This last experiment that rescues the phenotype is really terrific. The experiments are mostly very well done and the story is both interesting and conceptually novel. Nevertheless, there are a few points that I think the authors could address:

      1) It is very surprising that the cd59 reporter line only showed expression in a small subset (10% or less) of Schwann cells. How do the authors explain the widespread effects? Similarly, the authors make a point of stating that motor Schwann cells did not express cd59. Did myelinated motor axons show the same phenotype - reduced myelination, impaired node formation? How can the expression of cd59 in only 10% of cells cause widespread effects throughout the nerve? How can it limit overproliferation if 9/10 cells don't even express it?

      Thank you for the question. It is really interesting that a small subset of cells can have such a big impact on nerve development. One of our current hypotheses is that overproliferation of SCs has led to activation of contact-inhibition pathways, which in turn are negatively regulating myelination. We expand further on this hypothesis in our discussion (please see line 536). We also suggest questions addressing glial cell heterogeneity to explore in the future (please see line 603).

      In regards to the motor nerves, we quantified the number of Sox10-positive cells (SCs and MEP glia) on motor nerves at 72 hpf and showed that there was no overproliferation of these cell types (please see Figure 4I, J). We have not observed any issues with motor nerve myelination, which is what we would expect if motor SC proliferation was unaffected. That said, these differences between motor and pLLN SCs are really interesting because it opens up discussion for glial cell heterogeneity between nerve types (e.g. sensory versus motor nerves). We see similar evidence of this in satellite glia that populate the cochlear spiral ganglion versus those that surround the dorsal root ganglia (see Tasdemir-Yilmaz et al. 2021 or Wiltbank and Kucenas 2021), so it makes sense that there could be some SC diversity between nerve types as well. We expand further on these ideas in our discussion (please see line 603).

      2) It is surprising to me that there is a significant increase in SC proliferation, but no change in the length of myelin sheaths. Does this mean there are more SCs that remain unmyelinated and undifferentiated?

      Thank you for your comment. We were also surprised. With our current tools, we are unable to determine the fate of these extra SCs but hope that future studies will be able to clarify this question. We have added discussion around this topic to the text (please see line 554).

      3) The results showing deposition of the MAC (via C5b-9+C5b-8 immunostaining) are not convincing. The overall background level of immunostaining is dramatically increased. This result is central to the overall story in the paper. What controls were performed to confirm this doesn't simply reflect an overall higher background artifact during immunostaining?

      Thank you for your comment. We have added our antibody controls to the supplemental figures (please see Figure 6 – figure supplement 1A) demonstrating that we can increase MAC deposition by inducing complement activation (either through heat-related damage or DNase-elicited DNA damage). We also do not observe signal when the primary antibody is not present. Based on our controls, we do not think the extra MAC labeling is background. Rather, we believe that MAC deposition has increased globally in the cd59 mutant embryos. This is not surprising given that complement activation leads to a positive feedback loop of more complement and immune activation, which is likely occurring in the cd59 mutants.

      To help clarify the MAC data, we have also added Imaris renderings of the MACs that are bound to the SC membranes, demonstrating that there are more MACs embedded in the cd59 mutant SC membranes compared to wildtype SCs (see Figure 6B).

      4) Can the authors speculate on a mechanism for how promoting more MAC results in increased proliferation?

      Thank you for your question. We have added discussion around this topic to the text (please see line 585).

    1. Author Response:

      Reviewer #1:

      Weaknesses:

      For me, most of the weaknesses of this manuscript are related to the cluster detection:

      1. There is no consensus on the definition of transmission clusters in the field. However, the rational of taking the union (rather than the intersection) of two different methods (HIV-TRACE and cluster picker) did not become clear to me.

      2. HIV-TRACE defines clusters based on pairwise genetic distances and cluster picker identifies clusters using pairwise genetic distance with the guidance of a phylogenetic tree (and node support / bootstrap values). Given the underlying sample size and that the phylogeny was constructed already, the rationale for the purely distance related criterion of HIV-TRACE did not become clear.

      We thank the reviewer for their comments and are happy to provide additional results that motivate our decision to use the union of clusters detected with HIV-TRACE and Cluster Picker to estimate HIV transmissions within and between demographic sub-groups in the Botswana - Ya Tsie trial population. The primary motivation was that a filtering step was required to save time and computational resources from evaluating sequences that were too distantly related, before applying the “gold standard” of Phyloscanner to detect directed (when possible) transmission pairs. Accordingly, clustering algorithms plus a distance threshold helped to achieve this filtering. Because we shared what we take to be the reviewers’ concerns about either of the algorithms alone, we sought to maximize the number of transmission pairs that could be identified between participants in the Botswana – Ya Tsie trial with Phyloscanner by using the union of clusters detected with HIV-TRACE and Cluster Picker. This also served as a sensitivity analysis that allowed us to evaluate the extent to which the clustering patterns observed were specific to a single algorithm.

      Furthermore, a previous study done by Rose and colleagues (PMID: 27824249) to compare the number and size of clusters identified with HIV-TRACE and Cluster Picker clustering algorithms revealed that HIV-TRACE generally identified larger but fewer clusters, compared with clusters identified with Cluster Picker that were typically more numerous and mostly small 2-person clusters (Please see Figure 3B below extracted from Rose and colleagues (PMID: 27824249)). This suggested that HIV-TRACE would be helpful in detecting potentially larger transmission chains and Cluster Picker would be valuable in revealing potential transmission events between pairs of individuals.

      Of the 236 genetic clusters detected with the two algorithms, we identified 19 full or partial clusters (including 41 sequences) that included members that were only detected with HIV- TRACE and 122 full or partial clusters (including 242 sequences) that were unique to Cluster Picker. Moreover, of the 82 directed male-female transmission pairs inferred from the sample, (n = 5) were from genetic clusters that were unique to HIV-TRACE compared with (n = 27) that were from clusters unique to Cluster Picker. Of the five transmission events unique to HIV- TRACE clusters, three occurred in intervention communities originating from control communities. By contrast, four of the twenty-seven transmission events unique to Cluster Picker clusters occurred in intervention communities from control communities.

      In summary, estimates of HIV transmissions in the trial population based on the full overlap of clusters detected with HIV-TRACE and Cluster Picker would have excluded 32 of the 82 male- female pairs used for the primary analysis.

      1. For a phylogeny of this size it is feasible to calculate real bootstrap values instead of using (in my experience more liberal) Shimodaira-Hasegawa support values.

      We value the reviewer suggestion and agree that real bootstrap values could be ideal. However, the likely benefit of computing the suggested bootstrap values and thereafter repeating the entire analysis inferring transmission pairs with Phyloscanner and estimating transmission flows would be minimal. As noted above, liberality in a filtering step is a virtue (avoiding filtering out pairs of interest) as long as it does not lead to unfeasibly large computational burden, as this did not.

      1. In Supplementary Note 2.5 it is described how the linkage and direction of transmission score threshold of 57% was chosen. However, the finding that almost half of the accordingly selected probable source-recipient pairs were same-sex and had to be excluded from the analysis questions the reliability of the threshold.

      We apologize for the insufficient clarity in our description and would like the reviewer to kindly note that the threshold in of itself is insufficient to distinguish between Female-Female pairs separated by a single Male intermediate, but rather by design can distinguish between direct Male-Female pairs and Male-Female pairs separated by several intermediates. Once again, the threshold was meant to be a filter that would allow us to run Phyloscanner on a feasible number of sequences, thus appropriately should let through some pairs that are rejected by later steps in the pipeline. Also, kindly note that all previous Supplementary Notes are now presented in the methods section in line with the reviewer’s suggestions.

    1. Author Response:

      Reviewer #1:

      The authors present an interesting concept for the mechanism of rash induction in EGFR inhibitor (EGFRi) treated rats. EGFRi causes production of pro-inflammatory factors in epidermal keratinocytes which may induce dedifferentiation and reduction of the dWAT compartment, presumably mediated via PPAR. Factors produced by dedifferentiated FB then recruit monocytes thereby inducing skin inflammation. This work is aiming to improve targeted cancer therapy efficiency and is therefore of potential clinical relevance.

      However, most of the conclusions drawn by the authors are based on correlations, e.g. between the amount of dWAT and rash intensity. Mechanistic data have been mainly generated in vitro. The exact order of events to formulate a definitive mechanistic proof in vivo for this hypothesis is missing. In particular, it is not clear which cells in the skin, apart from keratinocytes, are specifically targeted by EGFR inhibitors and/or by Rosiglitazone. The authors also do not show EGFR staining in adipocytes and its inhibition by Afa. The effects of Afa and Rosi on monocytes / macrophages are completely ignored by the authors. Additionally, some of the presented results are overinterpreted and not really supporting what is claimed.

      Most importantly, the whole study is based on inhibitor treatments. Afatinib for example is not only inhibiting EGFR but all other erbB family members and as such it represents a panErbB inhibitor and it is not clear whether the observed effects are induced by inhibition of EGFR of other erbB receptors which have been shown to have also effects in the skin. For further specification of the role of EGFR, other, more specific inhibitors should be used to confirm the basic concept along with genetic proof either in genetically engineered mice or by Crispr-mediated-deletion.

      To further support the hypotheses of the authors, the study needs to be further substantiated by mechanistic experiments and the clinical relevance should be strengthened by performing histologic analysis of skin samples of patients treated with EGFRi and respective analysis of rash and e.g. BMI etc.

      Thanks for your positive comments on the potential impact for cancer patients suffering EGFR inhibitor induced skin rash. We have carefully considered all comments from the reviewer and revised our manuscript accordingly. In the following section, we summarize our responses to each comment of the reviewer. We believe that our responses have well addressed all concerns from the reviewer.

      We agree with the reviewer’s comment that our research may need more direct mechanistic in vivo studies upon our in vitro results. In our research, we have collected evidence from previous studies and used various in vitro and ex vivo experiments to investigate our findings. However, the study was still limited by currently available technologies.

      In the revised version, we supplemented the pEGFR and pERK staining of adipocytes in Figure 3-figure supplement 1C. The levels of phospho-EGFR and ERK in dWAT were significantly decreased after EGFRi treatment.

      This study was inspired by the observations of the unusual dWAT reduction during EGFRi treatment, thus we focused on the investigation of dermal adipocytes. In addition, the roles of mastocytes, monocytes, and macrophages in EGFRi-induced cutaneous toxicity have been thought as responders to increased expressions of cytokines. Local depletion of macrophages and degranulated mastocytes just provided partial resolution, indicating a multifactorial and complicated pathology of cutaneous toxicity induced by anti-EGFR therapy(Lichtenberger et al., 2013; Mascia et al., 2013).

      In terms of some inappropriate descriptions, we agree with the reviewer that they will be more convincing if there is a direct assessment from genetically engineered mice. For example, we tried to establish the relationship between S. aureus infection and EGFRi-induced rash based on a well-accepted study from Lingjuan Zhang (Zhang et al., 2015). They reported that adipose precursor cells secret antimicrobial peptide cathelicidin during differentiation to against S. aureus infection. Mice with impaired adipogenesis were more susceptible to S. aureus infection. This conclusion gave us insights into the relationship between S. aureus infection and EGFRi-induced skin inflammation. Unfortunately, the anti-CAMP antibody was made by the author’s lab and there are no mature products that can recognize CAMP in rats. To provide more mechanistic evidences, we conducted qPCR experiments to study the transcriptional level of the Camp gene both in dWAT and dFB cells isolated from rat skin (Figure 3I and 3J). dWAT in Afa group showed a lower expression level of Camp compared with control group. In addition, in different differentiation stages of dFB in vitro, transcriptional levels of Camp were decreased by Afa treatment while increased by Rosi. In summary, the data we collected could verify the causal relationship between EGFRi-induced dWAT reduction and S. aureus infection to some extent. However, the limitation of the technology is an obstacle for us to provide more evidences. Thus, in the revised manuscript, we have edited our writing to make the statement not that strong.

      According to the clinical evidence, the rash can also be induced by many specific Erbb1 inhibitors. All three generations of EGFR inhibitors in the clinic have very high incidence rates of cutaneous toxicity (Supplementary file 1). In the revised version, we provided rash models induced by both first-generation EGFRi, Erlotinib, Gefitinib, and the third-generation EGFRi, Osimertinib. As shown in Figure 1-figure supplement 1D, the rash caused by Erlotinib, Gefitinib, and Osimertinib had the same phenotypes as Afatinib-induced rash.

      In summary, the current form of evidences should support our findings, even more direct mechanistic studies would be better. We are now seeking the opportunity for cooperation to build a dermal adipocyte knockout mouse model platform and hope to investigate the specific roles of dermal adipocytes in the future. We also plan to have cooperation with hospitals to explore the clinical evidence of patients receiving EGFR inhibitors.

      References:

      Lichtenberger BM, Gerber PA, Holcmann M, Buhren BA, Amberg N, Smolle V, Schrumpf H, Boelke E, Ansari P, Mackenzie C, Wollenberg A, Kislat A, Fischer JW, Röck K, Harder J, Schröder JM, Homey B, Sibilia M. 2013. Epidermal EGFR controls cutaneous host defense and prevents inflammation. Sci Transl Med 5.

      Mascia F, Lam G, Keith C, Garber C, Steinberg SM, Kohn E, Yuspa SH. 2013. Genetic ablation of epidermal EGFR reveals the dynamic origin of adverse effects of anti-EGFR therapy. Sci Transl Med 5.

      Zhang L, Guerrero-juarez CF, Hata T, Bapat SP, Ramos R, Plikus M V, Gallo RL. 2015. Dermal adipocytes protect against invasive Staphylococcus aureus skin infection. Science 347:67–72.

      Reviewer #2:

      Leying Chen et al. investigated the mechanism of EGFR inhibitor-induced rash. They find that atrophy of dermal white adipose tissue (dWAT), a highly plastic adipose tissue with various skin-specific functions, correlates with rash occurrence and exacerbation in a murine model. The data indicate that EGFR inhibition induces the dedifferentiation of dWAT and lipolysis , finally lead to dWAT reduction which is a hallmark of the pathophysiology of rash. Notably, they demonstrate that stimulating dermal adipocyte expansion with a high-fat diet (HFD) or the pharmacological PPARγ agonist rosiglitazone (Rosi) ameliorated the severity of rash. Therefore, PPARγ agonists may represent a promising new therapeutic strategy in the treatment of EGFRI-related skin disorders pending to be confirmed in further study.

      We greatly appreciate the reviewer for giving the above positive comments.

      The conclusions of this paper are mostly well supported by data, but some results need to be clarified and verified.

      1) PPAR signaling in the pathology of EGFRI-induced skin toxicity. In figure 2 , the results show Rosi reversed the dedifferentiation of dermal adipocytes induced by Afa. This may due to PPARγ upregulation but not be confirmed in the results. The relative genes expression in dWAT after treated with Afa and ROSi were not demonstrated in the results.

      We thank the reviewer for reminding us for additional experiment of PPARγ. In the revised version, we collected attatched-dWAT after 5-day Afa or Rosi treatment, and performed transcriptional experiment of Pparg. The expression level of Pparg was downregulated by Afa treatment and upregulated by Rosi treatment (Figure 2-figure supplement 1D).

      2) the effect of PPAR signaling on PDGFRA-PI3K-AKT pathway The AKT pathway is a key downstream target of EGFR kinase, so it is reasonable to see p-AKT1 and p-AKT2 levels were decreased by Afa (figure 3C) However, addition of Rosi to Afa significantly activated both AKT1 and AKT2 . What is the underlying mechanism for the results and whether it is related to the PPAR signaling pathway.

      Given the importance of the PI3K/AKT pathway in regulating AP and mature adipocyte biology(Jeffery et al., 2015), we used p-AKT to characterize the activation of dFBs. The mechanism of how modulating PPARγ affects AKT is still unknown. One study found that MAPK and PI3K are upregulated and activated by rosiglitazone that in turn might enhance adipogenesis(Fayyad et al., 2019). In skeletal muscle, PPARγ enhances insulin-stimulated PI3K and Akt activation(Marx et al., 2004). It is also reported rosiglitazone has a neuroprotection effect against oxidative stress. The PPARγ-rosiglitazone complex binds to the neurotrophic factor-α1 (NF-α1) promoter and activates the transcription of NF-α1 mRNA which is then translated to the protein. NF-α1 binds to a cognate receptor and activates the AKT and ERK pathways(Thouennon et al., 2015). Thus, further studies should be carried out to investigate the effects of rosiglitazone to PI3K/AKT pathway on adipogenesis.

      3) According to figure 3 F , 3G and 3H., authors draw a conclusion that " a lack of APs and mature dWAT impairs the maintenance of the host defense and hair growth in the skin" In my opinion, there are no results can directly prove this. According to figure 3H, the impairment of hair growth may be caused by EGFR inhibition of hair follicles.

      We appreciate the reviewer for pointing this important point out. We tried to establish the relationship between S. aureus infection and EGFRI-induced rash based on a well-accepted study from Lingjuan Zhang (Zhang et al., 2015). They reported that adipose precursor cells secret antimicrobial peptide cathelicidin during differentiation to against S. aureus infection. Mice with impaired adipogenesis were more susceptible to S. aureus infection. This conclusion gave us insights into the relationship between S. aureus infection and EGFRI-induced skin inflammation. Unfortunately, the anti-CAMP antibody was made by the author’s lab and there are no mature products that can recognize CAMP in rats. To provide more mechanistic evidences, we conducted qPCR experiments to study the transcriptional level of the Camp gene both in dWAT and dFB cells isolated from rat skin (Figure 3I and 3J). dWAT in Afa group showed a lower expression level of Camp compared with control group. In addition, in different differentiation stages of dFB in vitro, transcriptional levels of Camp were decreased by Afa treatment while increased by Rosi. In summary, the data we collected depending on the current technology could verify the causal relationship between EGFRI-induced dWAT reduction and S. aureus infection to some extent. However, we agree with the reviewer that this conclusion needs more direct evidence. Thus, in the revised manuscript, we have edited our writing to make the statement not that strong.

      Since recent reports have shown that dermal adipocytes have the capacity to support hair regeneration, we used this conclusion to characterize the function of dWAT. However, we agree with the reviewer that it needs more specific and direct experiments to verify the causality with dWAT. And we are seeking the opportunity for cooperation to build a dermal adipocyte knockout mouse model platform and hope to investigate the specific roles of dermal adipocytes in the future. In the revised manuscript, we also adjusted the statements.

      4) EGFRI stimulates keratinocytes (HaCaT cells) to produce lipolytic cytokines (IL-6) (Figure 4G). IL6 enhanced the lipolysis of differentiated dFB (Figure S4M) and C18 fatty acids were supposed to be released the cell matrix during lipolysis. In figure 4H, HaCaTcells supernatants and dFB supernatants were collected. IL-6 was supposed to increase in HaCaTcells supernatants and was confirmed in Figure 4SK and S4L.However, C18 fatty acids were not showed to be in the dFB supernatants in the study directly.

      We thank the reviewer for pointing this out. We conducted additional lipidomics of dFB supernatants. However, because the differentiation medium needs to be changed every two days, it is hard to accumulate enough FFAs. We collected supernatants on Day3, Day 6, and Day 9. They were all below the detection limit of mass spectrum. We agree with the reviewer that more evidences are needed to prove the correlation between C18 FFAs and lipolysis. Therefore, we performed a mass spectrometry analysis of skin tissues from Ctrl and Afa groups after 3-day treatment to confirm the releasing of C18 FFAs. The result showed an increased tendency of C18:2 and other FFAs in the Afa group (Figure 1 in response letter). However, this increase had no significant statistic difference. This might be due to the interference of sebaceous gland and dermal adipocytes. In consequence, we adjusted the descriptions in the revised manuscript to make this statement not that strong.

      Figure 1. C18 concentrations in skin tissues from Ctrl and Afa groups after 3-day treatment. n=3.

      References:

      Fayyad AM, Khan AA, Abdallah SH, Alomran SS, Bajou K, Khattak MNK. 2019. Rosiglitazone Enhances Browning Adipocytes in Association with MAPK and PI3-K Pathways During the Differentiation of Telomerase-Transformed Mesenchymal Stromal Cells into Adipocytes. Int J Mol Sci 20.

      Jeffery E, Church CD, Holtrup B, Colman L, Rodeheffer MS. 2015. Rapid depot-specific activation of adipocyte precursor cells at the onset of obesity. Nat Cell Biol 17:376–385.

      Marx N, Duez H, Fruchart J-C, Staels B. 2004. Peroxisome proliferator-activated receptors and atherogenesis: regulators of gene expression in vascular cells. Circ Res 94:1168–1178. Thouennon E, Cheng Y, Falahatian V, Cawley NX, Loh YP. 2015. Rosiglitazone-activated PPARγ induces neurotrophic factor-α1 transcription contributing to neuroprotection. J Neurochem 134:463–470.

      Zhang L, Guerrero-juarez CF, Hata T, Bapat SP, Ramos R, Plikus M V, Gallo RL. 2015. Dermal adipocytes protect against invasive Staphylococcus aureus skin infection. Science 347:67–72.

    1. Author Response:

      Reviewer 2 (Public Review):

      Weaknesses 1. I had difficulty following the ANOVA results for Figure 1. I assume ANOVA was performed with factors of session and block. However, a single F statistic is reported. I do not know what this is referring to. It would be more appropriate to either perform repeated measures ANOVA with session and block as factors for each dependent variable or even better, multiple analyses of variance for the three dependent measures concurrently. Then report the univariate ANOVA results for each dependent measure. The graphs in Figure 1 (C-E) suggest a main effect of block, but as reported, I cannot tell if this is the case. Further, why was sex not included as an ANOVA factor?

      For the sake of transparency, we had included plots showing sessions split by each block whereas statistics related to the right side bar plots where data are collapsed across risk (which was done to minimize effects from ‘missing’ data). We appreciate that this may have caused confusion. In the revised manuscript we specify the exact figure for each statistical result and have added a better description in the methods and updated the statistics (Table 1) with the ANOVA and post-hoc results.

      Previously we had used a mixed effects model because one subject did not complete any risk trials in session 3 but in the revised manuscript, we decided to remove that subjects’ sessions to permit RM ANOVA. As requested by the reviewer, we performed a multivariate analysis on risk and no risk trials. Because of the repeated measures design we opted to utilize the multiple RM package developed by Friedrich et al. 2019, which permits multivariate analysis on repeated measures data with minimal assumptions and bootstrapped p-values for small sample sizes. We found significant interactions for session (or treatment) and risk (see tables below). This justifies the two-way univariate ANOVA which is now reported in the rest of the manuscript. Sex differences were not included in the ANOVA because the study was not intended to assess sex differences but, rather, was designed according to NIH requirements for inclusion of males and females.

      Note: MATS test was utilized based on author recommendations in Friedrich et al., (2019) for when test violates singularity, which was reported. To replicate use a random seed of 8675309.

      Package link: https://rdrr.io/github/smn74/MANOVA.RM/man/multRM.html Publication: Friedrich, S., Konietschke, F., & Pauly, M. (2019). Resampling-based analysis of multivariate data and repeated measures designs with the R package MANOVA. RM. R J., 11(2), 380.

      1. The authors describe session 1 as characterized by 'overgeneralization' due to increased reward latencies. I do not follow this logic. Generalization typically refers to a situation in which a response to one action or cue extends to a second, similar action or cue. In the authors' design, there is only one cue and one action. I do not see how generalization is relevant here.

      This wording has been changed to “non-specific” in the results and discussion.

      1. The authors consistently report dmPFC and VTA 'neural activity'. The authors did not record neural activity. The authors recorded changes in fluorescence due to calcium influx into neurons. Even if these changes have similar properties to neural activity measured with single-unit recording, the authors did not record neural activity in this manuscript.

      We do not imply that we recorded unit activity in these studies and state in the introduction that fiber photometry is an indirect measure of neural activity. We have also reworded much of the text in the manuscript to use “calcium activity.”

      1. Comparing the patterns in Figures 2 and 3, it appears that dmPFC change in fluorescence was similar in non-shocked and shock trials up until shock delivery. However, the VTA patterns differ. No cue differences were observed for sessions 1-3 on shock trials (Figure 3A), yet differences were observed on non-shocked trials (Figure 2F). Further, changes in fluorescence between sessions 1 and 2/3 appeared to emerge just as foot shock would have been delivered. A split should be evident in Figure 3B - but it is not. Were these differences caused by sampling issues due to foot shock trials being rarer?

      We agree, although some of this could be because footshock trials were collapsed across blocks 2 and 3 (as no differences in shock response was observed between blocks). Nevertheless, we have added a caveat about cue responses to the results (see page 11-bottom and 15-top). Regarding the lack of a split in Figure 3A – this difference may be due to shock onset time. The permutation tests indicate the differences in action activity in Figure 2B emerge about 0.5 – 0.8 seconds after action which is when the shock begins. So it is not surprising the results in 2F do not match well with 3A given the rapid and robust response to the footshock.

      1. Similar to Figure 1, I could not follow the ANOVA results for the effects of diazepam treatment on trials completed, action latency and reward latency (Figure 4). Related, from what session do the bar plot data in Figure 4B come from? Is it the average of the 6% and 10% blocks? I cannot tell.

      Please see our response in comment 1 for relevant analysis to this comment. Yes average of risk blocks is the average of 6 and 10%. Better explanation of what bar plot data represent are now explained in the methods.

      1. For the diazepam experiment, did all rats receive saline and diazepam injections in separate sessions? If so, were these sessions counterbalanced? And further, how did the session numbers relate to sessions 1-3 of the first study? All of these details are extremely relevant to interpreting the results and comparing them to the first study, as session # appeared to be an important factor. For example - the decrease in dmPFC fluorescence to reward during the No-Risk block appeared to better match the fluorescent pattern seen in sessions 1 and 2 of the first experiment. In which case, the saline vs. diazepam difference was due to saline rats not showing the expected pattern of fluorescence.

      Subjects received saline and diazepam in separate sessions. Furthermore, diazepam was not tested until animals had at least 3 sessions of training (range of sessions 4-8). Clarification has been added to the methods.

      The new AUC analysis for Reviewer 1 allows for better assessment of the potential differences between earlier sessions and saline (see figure 7- supplements 2 and 3). We also found the effect with reward and diazepam perplexing and somewhat modest. However, even after comparing only Saline and Session 3 PFC AUC data we found no significant effect of session or session*risk interaction for action or reward (F values < 1.3, p values >.27).

      1. The authors seem convinced that fiber photometry is a surrogate for neural activity. Although significant correlation coefficients are found during action and reward, these values hover around 0.6 for the dmPFC and 0.75 for the VTA. Further, no correlations are observed for cue periods. A strength of the calcium imaging approach is that it permits the monitoring of specific neural populations. This would have been very valuable for the VTA, in which dopamine and GABA neurons must show very different patterns of activity. Opting for fiber photometry and then using a pan-neuronal approach fails to leverage the strength of the approach.

      The parent paper (Park & Moghaddam, 2017) used unit recording in this task (including reporting data from dopamine and non-dopamine VTA units). We assure the reviewer that we do not claim that fiber photometry is a perfect surrogate for direct recording of neural activity. However, a key question we wanted to answer in this study was whether the response of PFC and VTA to the footshock changes during task acquisition (please see last paragraph of introduction), hence the choice to use fiber photometry. We note in the results and discussion that this approach is not optimal for detecting cue or other rapid responses (see page 15 and 23).

      Reviewer 3 (Public Review):

      Probably the biggest overall issue is that it is unclear what is being learned specifically. There is no probe test at the end to dissociate the direct impact of shock from its learned impact. And the blocks are not signaled in some other way. And though there seems to be some evidence that the shock effects get more pronounced with a session, it is not clear if the rats are really learning to associate specific shock risks with the particular trials. Indeed with so few sessions and so few actual shocks, this seems really unlikely, especially since without an independent cue, the shock and its frequency is the cue for the block switch. It seems especially unlikely that there is a strong dichotomy in the rats model of the environment between 6% and 10% blocks. This may be quite relevant for understanding foraging under risk. But I think it means some of the language in the paper about contingencies and the like should be avoided.

      While the parent paper (Park & Moghaddam, 2017) delved more deeply into this question we agree that what exactly is learned may be difficult to ascertain. To address this (please also see response to reviewer #1’s first comment), we have toned down our use of the “contingency learning” throughout the manuscript and use the word contingency in relation to the underlying reinforcement/punishment schedules.

      The second issue I had was that I had some trouble lining up the claims in the results with what appeared to be meaningful differences in the figures. Just looking at it, it seems to me that VTA shows higher activities at higher shocks, particularly at the time of reward but also when comparing safe vs risky anyway for the cue and action periods. DmPFC shows a similar pattern in the reward period. […] But these results are not described at all like this. The focus is on the action period only and on ramping? I don't really see ramping. it says "Anxiogenic contingencies also did not influence the phasic response to reward...". But fig 3 seems to show clearly different reward responses? The characterization of the change is particularly important since to me it looks like the diazepam essentially normalizes these features of the response. This makes sense to me […].

      We initially believed that much of the differences in reward (with the exception of Session 2 in the PFC) were from carryover of differences in the peri-action period. However upon quantifying these responses again using AUC change scores to adjust for pre-event differences in the signal, we observed small reward related increases (data are in Figure 7 – supplements 2/3) and have updated results and the discussion.

      Although some lessening of reward response may be apparent across the diazepam session in the VTA (Figure 7 – supplement 2/3G), we do not have statistical support for this as no significant differences were observed in permutation comparisons to saline and only session 3 deviated from the first session for the reward period in the AUC analyses.

    1. Author Response:

      Reviewer #1:

      General overview and merit of academic rigor:

      Xu et. al put forth an innovative experimental pipeline to examine the connections of the raphe nuclei. This manuscript details elegantly designed viral tract-tracing methods coupled with fMOST intact imaging and sophisticated analyses. All figures are of good quality. The studies presented in the current manuscript will be a valuable contribution to the field, therefore an enthusiastic recommendation for publication is endorsed presently. However, there is a cluster of revisions and clarifications warranted before publication.

      Major concerns:

      1. The manuscript's English needs to be proofread extensively for readability and clarity.

      We invited two native English experts to proofread the manuscript's English and revise the whole manuscript.

      1. The term MR (median raphe) is used in the atlas of Paxinos and Franklin. But, the entire study follows the Allen Reference Atlas nomenclature, in which the same raphe nucleus is called the "Superior center nucleus" (CS). To keep consistency, I suggest using "CS" instead of "MR". Alternatively, in the Introduction, please make a clear statement that the MR is equivalent to CS in the Allen Reference Atlas.

      As suggested, we added the statement that MR is equivalent to CS in the Allen CCFv3 in Line 15-18.

      “The dorsal raphe nucleus (DR) and median raphe nucleus (MR, equivalent to the superior central nucleus raphe in the Allen Mouse Brain Common Coordinate Framework version 3 (Allen CCFv3)) belong to the rostral group of the raphe nuclei and contain most of brain’s serotonergic neurons (Wang et al., 2020; Watson, et al., 2012).”

      1. In the Introduction, it is unclear the rationale behind the decision to selectively study the DR and MR here (why other raphe nuclei are not included?).

      We have revised the Introduction and described why to selectively study the DR and MR in Line 15-25.

      “The dorsal raphe nucleus (DR) and median raphe nucleus (MR, equivalent to the superior central nucleus raphe in the Allen Mouse Brain Common Coordinate Framework version 3 (Allen CCFv3)) belong to the rostral group of the raphe nuclei and contain most of brain’s serotonergic neurons (Wang et al., 2020; Watson, et al., 2012). The DR and MR are involved in a multitude of functions (Domonkos et al.,2016; Huang et al., 2019; Szőnyi et al., 2019); moreover, they have different, and even antagonistic roles in the regulation of specific functions, including emotional behavior, social behavior, and aggression (Balázsfi et al., 2018; Ohmura et al., 2020; Teissier et al., 2015). The diverse regulatory processes are related to the connectivity of heterogeneous raphe groups (Muzerelle et al., 2016; Nectow et al., 2017; Schneeberger et al., 2019). Deciphering precise input and output organization of different neuron types in the DR and MR is fundamental for understanding their specific functions.”

      1. In the Results, I did not find any figure panel or images to show the anatomical location of the MR. Figure 1 shows only one injection site in DR. It is necessary to also show at least one representative injection site in the MR.

      As suggested, we added more information of injection site in Figure 1—figure supplement 2 and Figure 4—figure supplement 1.

      Figure1—figure supplement 2. Validation of the labeling of whole-brain inputs. (A) Representative coronal section of the injection site showing the starter cells (cyan). The image is from a representative sample that label the inputs to MR Gad2+ neurons. Scale bar, 1mm. (B) Enlarged view of dotted box area in (A). Scale bar, 100 μm. (C) The number and on-target rate of labeled starter neurons, and the ratio of input neurons to starter cell. The data are from validation samples that label the inputs to MR Gad2+ neurons. Data are shown as mean ± s.e.m., n = 3. (D) Comparison of inputs to MR Gad2+ neurons.

      Figure 4—figure supplement 1. Validation of the injection sites of whole-brain outputs. (A) Representative coronal section of the injection site of a representative sample that label the outputs of MR Vglut2+ neurons. The dataset has been registered to the Allen CCFv3. White dotted lines, MR in the Allen CCFv3; Yellow lines, segmented injection site. (B) Representative coronal sections of the injection site of the representative sample in (A). (C) Proportion of signal of the injection site in the DR/MR. Data are shown as mean ± s.e.m., n = 4 per group. Scale bars, A, 1 mm; B, 500 μm.

      1. This study is designed to map the input/output of two major populations of neurons (Glu+ and GABAergic) in the DR and MR using two cre-driver lines (Vglut2-cre and Gad2-cre). Please clarify how these two cre lines were characterized and whether those cre expressions are consistent with endogenous gene expressions. What are their distribution patterns in the DR and MR? Are they intermingled or relatively segregated? How are their distributions in comparison with that of serotonergic neurons?

      The Vglut2-Cre and Gad2-Cre mice were purchased from Jackson Laboratory and carried out genotyping according to the instructions. To verify the expressions characterization and distribution pattern of Vglut2+ and Gad2+ neurons, we crossed the Cre driver line mice with reporter line respectively (Figure 1—figure supplement 1). In the DR, Vglut2+ neurons were mostly found in the rostral part of the DR, while Gad2+ neurons were widely distributed and densely assembled in the lateral DR. In the MR, Vglut2+ neurons were mainly found in the caudal part of the MR, and the Vglut2+ neurons in the rostral part of the MR were mainly distributed laterally; moreover, Gad2+ neurons were distributed throughout the MR. And there are obvious differences between the overall distribution pattern of Vglut2+ and Gad2+ neurons in the same raphe nucleus. Compared with the distribution of serotonergic neurons (http://connectivity.brain-map.org/ transgenic/experiment/100140881), the distribution of Vglut2+ neurons seem to be relatively segregated with them, and the distribution of Gad2+ neurons are intermingled with them.

      As Gad2-Cre generally labels all mature GABAergic neurons, while Vglut2-Cre only labels a population of glutamatergic neurons, and there are also numerous Vglut3+ neurons in the DR and MR, we decide to perform experiments to characterize the specificity of the labeled Vglut2+ starter cells. We performed in situ hybridization to characterize the specificity of labeled starter cells in the Vglut2-Cre mice and found that they were Vglut2 positive, with a few simultaneously being Vglut3 positive (Figure 1B,C; Figure1—figure supplement 3), which was confirmed by immunohistochemical staining (Figure 1—figure supplement 4).

      Figure 1—figure supplement 1. Distribution and total number of Vglut2+ and Gad2+ neurons in the DR and MR. (A) Representative coronal sections of maximum intensity projection showing the distribution of Vglut2+ and Gad2+ neurons in the DR. The projections are 200 μm thick. Scale bar, 200 μm. The total number of Vglut2+ and Gad2+ neurons in the DR are presented as mean ± s.e.m., n = 2. (B) Representative coronal sections of maximum intensity projection showing the distribution of Vglut2+ and Gad2+ neurons in the MR. The projections are 200 μm thick. Scale bar, 200 μm. The total number of Vglut2+ and Gad2+ neurons in the MR are presented as mean ± s.e.m., n = 2. (C) Density plot of specific neuron types in the DR and MR along the anterior-posterior axis. Bin width, 100 μm. The shaded area indicates s.e.m., n=2.

      Figure 1—figure supplement 3. Characterization of the specificity of starter cells using in situ hybridization. (A) In situ hybridization at the MR in Vglut2-Cre mouse. (B) Enlarged view of the box area in (A). White arrows, starter cells. Scale bar, A, 200 μm, B, 20 μm.

      Figure 1—figure supplement 4. Validation of the specificity of starter cells using immunohistochemical staining. (A) Immunohistochemical staining against Vglut3 at the DR in Vglut2-Cre mouse. White arrows, starter cells. Red arrows, starter cells that are Vglut3 positive. (B) Immunohistochemical staining against Vglut3 at the MR in Vglut2-Cre mouse. White arrows, starter neurons. Red arrows, starter cells that are Vglut3 positive. (C) Control experiment, immunohistochemical staining against Vglut3 at the SSp in Vglut2-Cre mouse. Scale bar, 50 μm.

      1. Overall Discussion is not well organized. I suggest to start with a clear statement about the novel discoveries of this study in comparison with existing literature, and then compare the overall input/output patterns of Glu+ and GABAergic populations in the DR and MR. The current discussion focuses on a few major targets (i.e., CEA, LH), but missed a big picture. Additionally, it is necessary and important to carefully compare their connectivity patterns with that of serotonergic neurons in these two raphe nuclei.

      As suggested, we have reorganized the Discussion. At first, we compared the results with existing literature and pointed out the similarities and differences of connectivity patterns compared with that of serotonergic neurons. Then, we compared the overall input/output patterns of glutamatergic and GABAergic neurons in the DR and MR and discussed their implications for behavior functions. At last, we discussed the potential caveats in our viral tracing techniques and data analysis.

      Minor concerns:

      1. The Impact statement reads, "We reconstructed the input-output circuits of glutamatergic and GABAergic neurons in the dorsal raphe nucleus and median raphe nucleus and proposed a more refined model of the habenula-raphe circuit." When a comparison like this is put down, a specific reference to what your method is more refined than is required. This is well explained in lines 242 and 243, "Based on the conventional model of the habenula-raphe circuit (Hikosaka, 2010; Hu et al., 2020), we proposed a more refined model of the habenula-raphe circuit (Figure 5C)." Make a similar claim earlier in the Impact statement.

      We have revised the impact statement as follow:

      “Whole-brain quantitative input-output circuits of glutamatergic and GABAergic neurons in the mouse dorsal and median raphe nuclei were mapped using viral tracing and high-resolution optical imaging.”

      1. For Figure 2A, it would be easier on the reader if inputs for each region (DR and MR) and each plane of the section were placed on the same image akin to the inputs presented on coronal maps in Figures 2B and 5A and the inputs/outputs for each region (MR and DR) in the sagittal summary diagrams in Figure 7.

      For Figures 2A and 2B, we wanted to present the whole-brain inputs from different perspectives. For Figure 2A, as there were tens of thousands of input neurons and the input patterns were similar, if we placed the inputs on the same image, the color would mix up and it would be difficult to see clearly. Thus, we presented them separately in sagittal and horizonal views in Figure 2A. Further, we presented the inputs together on coronal maps in Figure 2B.

      1. It is unclear what the nonsignificant grey open circles represent in Figures 3A-D; 4D and E.

      In Figures 3A-D, the circles represent the proportion of input neurons in each brain region. If there is a significant difference between the inputs in one brain region to two neuron groups, the circle is red and solid, and the name of the brain region are presented nearby. If there is no significant difference between the inputs in that region to two neuron groups, the circle is gray and hollow. To highlight those brain areas with significant differences, the names of these brain regions are not presented. Moreover, we provided the source data in Supplementary File 2. As for Figures 4D and E, it is akin to Figures 3A-D.

      1. In Figure 4A, the imaging portion would be clearer if it read "optical sectioning."

      As suggested, we revised the image portion to make it clearer.

      1. In Figure 7A and B, the position of ACA on the flatmap looks odd to me (it is a little bit too caudal).

      As suggested, we revised the position of ACA on the flatmap.

      Reviewer #2:

      This work from Xu et. al. "Whole-brain connectivity atlas of glutamatergic and GABAergic neurons in mouse dorsal and median raphe nucleus" provided a comprehensive brain-wide analysis for input and output patterns to/from specific DR/MR neuronal populations in adult mouse brain. With exceptional strength in experimental approaches for high quality whole brain imaging that this group is famous for, their data and analysis are thorough and convincing for the general conclusion of the manuscript for describing both convergent and divergent patterns of DR/MR connectivity. While the current study is based on structural but not functional correlation analysis, the results are validated with prior knowledge of the field. It will provide a more complete picture to facilitate future investigation of DR/MR connectivity and physiological functions.

      The work would provide a significant and useful knowledge for the field, while also promoting the generation and application of advanced brain-wide profiling resource to advance board neuroscience research topics. However, there are still a few technical and analytical concerns that need to be addressed or discussed to refine the conclusions.

      Major concerns:

      1. For targeted injection-based analysis, it is critical to carefully analyze and discuss on-target vs off-target rates of labeled cells in DR/MR to validate the datasets. Whole mount data would best fit for such accurate analysis not possible before.

      As for the inputs, samples from the same batch of virus tracing experiments were treated as validation datasets to analyze on-target rates of labeled starter cells. As for samples that label inputs to MR Gad2+ neurons, the on-target rate of labeled starter cells is 66.40 ± 2.78% (Figure 1—figure supplement 2C). And we counted the input neurons and calculated the ratio of input neurons to starter cells (Figure 1—figure supplement 2C). Compared with experiment datasets, they have consistent input patterns (Figure 1—figure supplement 2D). As for the outputs, we manually segmented the injection region and calculated the proportion of signal of the injection region in the DR/MR (Figure 4—figure supplement 1).

      1. It is also important to know what percentage of the cells get labeled over individual samples, and how many samples and in total what coverage/saturation over the entire anatomical structure has been achieved to justify a complete/comprehensive analysis.

      We counted the Vglut2+ and Gad2+ neurons in the DR and MR in crossed mice (Vglut2-Cre: LSL-H2B-GFP mice and Gad2-Cre: LSL-H2B-GFP mice; Figure 1—figure supplement 1A,B). As for the inputs to MR Gad2+ neurons, the labeling rate is 11.60±1.28 % (Figure 1—figure supplement 2C). As for the outputs, we counted the labeled Vglut2+ and Gad2+ neurons in the DR and MR and calculated the percentage (DR Vglut2+: 18.38±8.33%, DR Gad2+: 10.66±2.65%, MR Vglut2+: 43.67±8.25%, MR Gad2+: 11.10±2.09%) (Supplementary File 3). The data were replicated in 4 samples, which was comparable to previous studies of input and output circuits (Ährlund-Richter et al. 2019. Nature Neuroscience, 22: 657–668; Do et al. 2016. eLife, 5: e13214; Gehrlach et al. 2020. eLife, 9: e55585.).

      1. Further on last point, the labeling rates need to be small enough to warrant a more meaningful analysis in Figure 6. From another aspect, is there any anatomical correlation of the target sites in DR/MR for the distinct input/output clusters? This can probably be best addressed with single neuron resolution analysis that this group is good at. For the current study it is a vital part to include this detailed information for better resource to the field (e.g. to guide or map to future spatial transcriptomic analysis to study molecular-cellular correlations).

      Following the previous question, the labeling rate is at a low level, which could ensure that the analysis is meaningful. The analysis in Figure 6 implied that the glutamatergic and GABAergic neurons in the DR/MR might receive inputs from and project to various unions of brain regions. The brain regions in one cluster might be connected with the same subsets of specific neuron types. The brain regions of negative correlation might be connected with distinct subsets of specific neuron types (Weissbourd et al. 2014. Neuron 83: 645–662). As for the inputs to DR Vglut2+ neurons, Vglut2+ neurons receiving inputs from the SNc might be the same as those receiving inputs from the VTA and SNr, but distinct from those receiving inputs from the BST (Figure 6A). These implications are worth illustrating through complete single neuron reconstruction. However, single neuron reconstruction needs substantial time, which is beyond the scope of this work but in our future plans. And our datasets have been registered to the AllenCCFv3, which enables to be directly incorporated to more resource with the same coordinate system. Spatial transcriptome is the current research hotspot, but spatial localization cannot reach the level of single neuron, and it is difficult to integrate with the morphology. We are engaged in this research, but there is no significant progress.

      Reviewer #3:

      Xu et al utilize retrograde and anterograde viral tracing in Cre-transgenic mouse lines to map the inputs and outputs of glutamatergic and GABAergic neuronal populations in the dorsal (DR) and median raphe (MR) nucleus. The experiments generate a large anatomical dataset which the authors analyse with correlation analysis, revealing subtle differences in connectivity patterns between the targeted cell types and nuclei. The study furthermore focuses on the lateral habenula (LH) to raphe nucleus circuit, identifying large amounts of inputs from the LH to both glutamatergic and GABAergic DR and MR populations, but scarce projections from these cells back to the LH, with some cell-type specific differences. In particular, MR glutamatergic neurons send the strongest projections to LH among the targeted populations, supporting previous studies which identified this pathway as playing a role in aversive behaviors.

      Overall, this study nicely complements previously published work on whole-brain connectivity of the DR and MR which have chiefly focused on the main neuromodulatory neurons found in these nuclei, ie. serotonin and dopamine neurons. Some of the experiments in the study are not completely novel, such as input tracing to GAD2-expressing neurons in DR (Weissbourd et al, 2014). However, comprehensive side-by-side comparison analysis between glutamatergic and GABAergic connectivity of both DR and MR nuclei has not been performed before, and will provide a welcome resource to circuit neuroscientist looking to elucidate functional circuits of the raphe nuclei. A further strength of the study is the high-resolution 3D imaging, revealing three distinct projection pathways from MR glutamatergic neurons to LH.

      Two main concerns regarding the study are:

      1) The authors do not sufficiently justify the use of Vglut2 as a marker for glutamatergic neurons in DR and MR. The majority of previous studies, especially of the DR, use another glutamatergic marker which is more specifically expressed in the raphe nuclei, namely Vglut3. Vglut3 is much more anatomically restricted to the DR and MR (but has also been shown to partially overlap with serotonergic expression). In contrast, Vglut2 is very broadly expressed throughout the brain and in regions adjacent to DR and MR. For this reason, and from the data in the main manuscript as well as raw microscopy images provided in the accompanying website, it is unclear how specific the starter neuron targeting really is. The authors should show more detailed starter neuron analysis for both the broadly expressed Vglut2 and Gad2 in the DR and MR, showing the histology of the helper virus BFP and RV-ΔG-EnvA-GFP, their anatomical locations, and some quantification of proportion of starter cells within DR/MR (Fig 1B-C shows it only for Vglut2, but in insufficient detail). Furthermore, a rationale for using Vglut2 instead of Vglut3 would be appreciated, especially given that the vast majority of functional studies of the DR have used Vglut3.

      The authors also miss the chance to characterize the topography of Vglut2 and Gad2 starter cell expression within the DR and MR and emphasize the interesting differences between these two populations, which may be relevant to the differences in input and output connectivity.

      We added more information of starter cells in Figure1—figure supplement 2. And we performed in situ hybridization and immumohistochemical staining to characterize the specificity of the labeled Vglut2+ starter cells. The labeled starter cells were Vglut2 positive, while a fraction of them was simultaneously Vglut3 positive (Figure 1B, C; Figure1—figure supplement 3,4).

      As glutamatergic neurons in the DR and MR are mainly comprised of Vglut2+ neurons and Vglut3+ neurons, but numerous Vglut3+ neurons are also serotonergic (Huang et al. 2019. eLife, 8: e46464; Pinto et al. 2019. Nature Communications 10, 4633–4633; Sos et al. 2017. Brain Structure and Function, 222: 287–299.). The anatomical connections of serotonergic neurons in the DR and MR have been well studied (Pollak Dorocic et al. Neuron. 83: 663–678; Ren, et al.2019. eLife 8: e49424; Weissbourd et al. Neuron. 83: 645–662). DR and MR Vglut2+ neurons are relatively independent from Vglut3+ neurons. And they have been revealed to regulate multitudinous functions, such as emotional behaviors (Szőnyi et al.2019. Science 366: eaay8746), but their whole-brain connectivity remains incomplete. Thus, we choose to study the inputs and outputs of Vglut2+ neurons.

      And there are differences between the distribution of Vglut2+ and Gad2+ neurons in the DR/MR (Figure 1—figure supplement 1), and these differences might be relevant to the differences in input and output connectivity, which are worth illustrating in our future studies.

      2) The quantification throughout the manuscript refers to the relative proportion of inputs or outputs for each cell population and nucleus. The manuscript would be strengthened by also including total cell counts for starter cells in each group, as well as total numbers of input neurons. For example, is the Vglut2 population in DR much larger than the Gad2 population, and do DR Vglut2 neurons receive more inputs in total than DR Gad2 neurons? Including raw numbers would provide concrete information to contextualize connectivity patters between cell types and nuclei to the readers.

      We added the number of input neurons in Supplementary File 2. As we discussed in lines 413-415, the monosynaptic rabies tracing technique might only label a portion of inputs, and the labeling could be biased toward specific neuron types and affected by many factors. Further, the ratio of the number of input neurons to starter cells variate in a vast range (Callaway and Luo. 2015. The Journal of Neuroscience, 35: 8979–8985). Thus, the larger population of specific neuron types might not indicate that they receive more inputs.

    1. Author Response

      Reviewer #2 (Public Review):

      The manuscript by Ma et al, "Two RNA-binding proteins mediate the sorting of miR223 from mitochondria into exosomes" examines the contribution of two RNA-binding proteins on the exosomal loading of miR223. The authors conclude that YBX1 and YBAP1 work in tandem to traffic and load miR223 into the exosome. The manuscript is interesting and potentially impactful. It proposes the following scenario regarding the exosomal loading of miR223: (1) YBAP1 sequesters miR223 in the mitochondria, (2) YBAP1 then transfers miR223 to YBX1, and (3) YBX1 then delivers miR223 into the early endosome for eventual secretion within an exosome. While the authors propose plausible explanations for this phenomenon, they do not specifically test them and no mechanism by which miR223 is shuttled between YBAP1 and YBX1, and the exosome is shown. Thus, the paper is missing critical mechanistic experiments that could have readily tested the speculative conclusions that it makes.

      Comments:

      1) The major limitation of this paper is that it fails to explore the mechanism of any of the major changes it describes. For example, the authors propose that miR223 shuttles from mitochondrially localized YBAP1 to P-body-associated YBX1 to the exosome. This needs to be tested directly and could be easily addressed by showing a transfer of miR223 from YBAP1 to YBX1 to the exosome.

      Testing this idea using fluorescently labeled miR223 would indeed be an ideal experiment. However, miRNA imaging presents challenges. As reviewer 1 pointed out, and we have now confirmed, the atto-647 dye itself localizes to mitochondria. We will continue our efforts to identify a suitable fluorescent label for miR223in order to be in a position to evaluate the temporal relationship between mitochondrial and endosomal miR223.

      2) If YBAP1 retains miR223 in mitochondria, what is the trigger for YBAP1 to release it and pass it off to YBX1? The authors speculate in their discussion that sequestration of mito-miR223 plays a "role in some structural or regulatory process, perhaps essential for mitochondrial homeostasis, controlled by the selective extraction of unwanted miRNA into RNA granules and further by secretion in exosomes...". This is readily testable by altering mitochondria dynamics and/or integrity.

      A previous study has reported that YBAP1 can be released from mitochondria to the cytosol during HSV-1 infection (Song et al., 2021). However, due to restrictions, we are unable to conduct experiments using HSV to verify this condition. We attempted to induce mitochondrial stress by using different concentrations of CCCP, but we did not observe the release of YBAP1 from mitochondria after CCCP treatment. We speculate that not all mitochondrial stress conditions can trigger YBAP1 release. Investigating the mechanism of mito-miR223 release from mitochondria is one of our interests that we aim to explore in future studies.

      3) Much of the miRNA RT-PCR analysis is presented as a ratio of exosomal/cellular. This particular analysis assumes that cellular miRNA is unaffected by treatments. For example, Figure 1a shows that the presence of exosomal miR223 is significantly reduced when YBX1 is knocked out. This analysis does not consider the possibility that YBX1-KO alters (up or down-regulates) intracellular miR223 levels. Should that be the case, the ratiometric analysis is greatly skewed by intracellular miRNA changes. It would be better to not only show the intracellular levels of the miRs but also normalize the miRNA levels to the total amount of RNA isolated or an irrelevant/unchanged miRNA.

      Our previous publications demonstrated that miR223 levels are increased in YBX1-KO cells and decreased in exosomes derived from YBX1 KO cells. However, no significant changes were observed in miR190 levels (Liu et al., 2021; Shurtleff et al., 2016). The repeated data has been included in Figure 1a.

      For the analysis of other miRNAs by RT-PCR, we assessed changes in intracellular and exosomal miRNA levels in the corresponding figures. In the qPCR analysis, miRNA levels were normalized to the total amount of RNA.

      4) In figure 1, the authors show that in YBX1-KO cells, miR223 levels are decreased in the exosome. They further suggest this is because YBX1 binds with high affinity to miR223. This binding is compared to miR190 which the authors state is not enriched in the exosome. However, no data showing that miR190 is not present in the exosome is shown. A figure showing the amount of cellular and exosomal miR223 and 190 should be shown together on the same graph.

      In previous publications we demonstrated that miR190 is not localized in exosomes and not significantly changed in YBX1 knockout (KO) cells and exosomes derived from YBX1 KO cells (Liu et al., 2021; Shurtleff et al., 2016). The repeated data has been included in Figure 1a.

      5) Figure 2 Supplement 1 - As to determine the nucleotides responsible for interacting with YBX1, the authors made several mutations within the miR223 sequence. However, no explanation is given regarding the mutant sequences used or what the ratios mean. Mutant sequences need to be included. How do the authors conclude that UCAGU is important when the locations of the mutations are unclear? Also, the interpretation of this data would benefit from a binding affinity curve as shown in Fig 2C.

      The ratio is of labeled miR223/unlabeled miR223 (wt and mutant). All mutant sequences of miR223 have been included in Figure 2 supplement 1.

      6) While the binding of miR223mut to YBX1 is reduced, there is still significant binding. Does this mean that the 5nt binding motif is not exact? Do the authors know if there are multiple nucleotide possibilities at these positions that could facilitate binding? Perhaps confirming binding "in vivo" via RIP assay would further solidify the UCAGU motif as critical for binding to YBX1.

      The binding affinity of miR223mut with YBX1 is reduced approximately 27-fold compared to miR223. We speculate that the secondary structure of miR223 may contribute to the interaction with YBX1.

      Our EMSA data, in vitro packaging data, and exosome analysis reinforce the conclusion that UCAGU is critical for YBX1 binding. These findings suggest that the presence of the UCAGU motif in miR223 is crucial for its interaction with YBX1 and subsequent sorting into exosomes.

      7) Figures 2g, h - It would be nice to show that miR190mut also packages in the cell-free system. This would confirm that the sequence is responsible. Also, to confirm that the sorting of miR223 is YBX1-dependent, a cell-free reaction using cytosol and membranes from YBX1 KO cells is needed.

      Although we have not performed the suggested experiment, we purified exosomes from cells overexpressing miR190sort and observed an increase in the enrichment of miR190sort in exosomes compared to miR190. This finding confirmed that the UCAGU motif facilitates miRNA sorting into exosomes.

      Regarding the in vitro packaging assay, our previously published paper demonstrated that cytosol from YBX1 knockout (KO) cells significantly reduces the protection of miR223 from RNase digestion. We concluded that the sorting of miR223 into exosomes is dependent on YBX1 (Shurtleff et al., 2016).

      8) In Figure 3a, the authors show that miR223 is mitochondrially localized. Does the sequence of miR223 (WT or Mut) matter for localization? Does it matter for shuttling between YBAP1 and YBX1?

      The localization of miR223mut has not been tested in our current study. We plan to conduct these experiments in the future.

      9) Supplement 3c - Is it strange that miR190 is not localized to any particular compartment? Is miR190 present ubiquitously and equally among all intracellular compartments?

      Most mature miRNAs are predominantly localized in the cytoplasm. Although there is no specific subcellular localization reported for miR190 in the literature, our experimental findings indicate a relatively high expression of miR190 in 293T cells. It is likely that most of miR190 is localized in the cytosol. However, it is also possible that a small fraction of miR190 may associate with a membrane, which could explain its distribution in various subcellular structures. Importantly, we did not observe enrichment of miR190 in the mitochondria or exosomes.

      10) Figure 3h - Why would the miR223 levels increase if you remove mitochondria? Does CCCP also cause miR223 upregulation? I would have thought miR223 would just be mis-localized to the cytosol.

      We report that the levels of cytoplasmic miR223 increase following the removal of mitochondria using CCCP treatment. While we cannot rule out the possibility that upregulation of miR223 is directly caused by CCCP treatment, we suggest that miR223 becomes mis-localized to the cytosol upon mitochondrial removal. Our data suggests that mitochondria contribute to the secretion of miR223 into exosomes. When mitochondria are removed by mitophagy, cytosolic miR223 is not efficiently secreted, which provides an alternative explanation for the observed increase in miR223 level after mitochondrial removal.

      11) Figure 3i - What is the meaning of "Urd" in the figure label? This isn't mentioned anywhere.

      “Urd” represents Uridine. Uridine is now spelled out in figure 3i. The absence of mitochondria can impact the function of the mitochondrial enzyme dihydroorotate dehydrogenase, which plays a role in pyrimidine synthesis. To address this issue, one approach is to supplement the cell culture medium with Urd. A previous study demonstrated that primary fibroblasts showed positive responses when Urd was added to the cell culture medium, resulting in improved cell viability for extended periods of time (Correia-Melo et al., 2017).

      12) Figure 3j - The data is presented as a ratio of EV/cell. Again, this inaccurately represents the amount of miR223 in the EV. This issue is apparent when looking at Figures 3h and 3j. In 3h, CCCP causes an upregulation of intracellular miR223. As such, the presumed decrease in EV miR233 after CCCP (3j) could be an artifact due to increased levels of intracellular miR223. Both intracellular and EV levels of miRs need to be shown.

      Both the intracellular and exosomal levels of miR223 have been included in Figure 3j.

      13) In Figure 4, the authors show that when overexpressed, YBX1 will pulldown YBAP1. Can the authors comment as to why none of the earlier purifications show this finding (Figure 1 for example)? Even more curious is that when YBAP1 is purified, YBX1 does not co-purify (Figure 4 supplement 1a, b).

      In Figure 4a-b, human YBX1 fused with a Strep II tag was purified from 293T cells using Strep-Tactin® Sepharose® resin in a one-step purification process. Our data has shown that YBAP1 is expressed in 293T cells.

      In Figure 1 and Figure 4 Supplement 1a, human YBX1 or YBAP1 fused with His and MBP tags were purified from insect cells using a three-step purification process involving Ni-NTA His-Pur resin, amylose resin, and Superdex-200 gel filtration chromatography.

      One possibility is that human YBX1 or YBAP1 may not interact well with insect YBAP1 or YBX1, which could result in separate tagged forms of YBX1 or YBAP1 isolated from insect cells.

      Another possibility is that the expression levels of insect YBAP1 and YBX1 may be too low. Consequently, tagged forms YBX1 or YBAP1 expressed in insect cells may copurify with partners not readily detected by Coomassie blue stain. However, in Figure 4 Supplement 1b, human YBX1 fused with His and MBP tags was co-expressed with non-tagged human YBAP1, and both bands of YBX1 and YBAP1 were visible on the Coomassie blue gel after purification using Ni-NTA His-Pur resin, amylose resin, and Superdex-200 gel filtration chromatography.

      14) Figure 4f, g - The text associated with these figures is very confusing, as is the labeling for the input. Also, what is "miR223 Fold change" in this regard? Seeing as your IgG should not have IP'd anything, normalizing to IgG can amplify noise. As such, RIP assays are typically presented as % input or fold enrichment.

      The RIP assay results have been calculated and presented as a % input in Figure 4g.

      15) Figure 4h - The authors show binding between miR223 and YBAP1 however it is not clear how significant this binding is. There is more than a 30-fold difference in binding affinity between miR223 and YBX1 than between miR223 and YBAP1. Even more, when comparing the EMSAs and fraction bound from figures 1 and 2 to those of Figure 4h, the binding between miR223 and YBAP1 more closely resembles that of miR190 and YBX1, which the authors state is a non-binder of YBX1. The authors will need to reconcile these discrepancies.

      We agree that the binding of YBAP and YBX1 differ quite significantly in the affinity of their interaction with miR223. It is difficult to draw conclusions from a comparison of the affinities of YBX1 for miR190 and YBAP1 for miR223. Nonetheless, a quantitative difference in the interaction of YBAP1 with miR223 and miR190 is apparent (Fig. 4 h, I, j) and we observed no enrichment miR190 in isolated mitochondria (Fig. 3 supplement 1a) whereas YBAP1 selectively IP’d miR223 from isolated mitochondria (Fig. 4 f and g).

      16) Can the authors present the Kd values for EMSA data?

      The Kd values for the EMSA data have been added to the respective figures.

      17) Figure 5 - Does YBAP1-KO affect mitochondrial protein integrity or numbers?

      We generated stable cell lines expressing 3xHA-GFP-OMP25 in both 293T WT and YBAP1-KO cells, but we did not observe any alterations in mitochondrial morphology (Author response image 1).

      Author response image 1.

      Additionally, we performed a comparison of different mitochondrial markers using immunoblot in 293T WT cells and YBAP1-KO cells and did not observe any changes in these markers (data has been included in Figure 5b.).

      18) Figure 6a - Are the authors using YBAP1 as their mitochondrial marker? Please include TOM20 and/or 22.

      In Figure 4c and 4e, our data clearly demonstrate that the majority of YBAP1 is localized in the mitochondria.

      To further validate this localization, we performed immunofluorescence staining using antibodies against endogenous Tom20 and YBX1. The immunofluorescence images document YBX1 associated with mitochondria (Author response image 2 and new Fig 6a.).

      Author response image 2.

      19) Figure 6b - Rab5 is an early endosome marker and may not fully represent the organelles that become MVBs. Co-localization at this point does not suggest that associating proteins will be present in the exosome, and it is possible that the authors are looking at the precursor of a recycling endosome. Even more, exosome loading does not occur at the early endosome, but instead at the MVB. Perhaps looking at markers of the late endosome such as Rab7 or ideally markers of the MVB such as M6P or CD63 would help draw an association between YBX1, YBAP1, and the exosome. Also, If the authors want to make the claim that interactions at the early endosome leads to secretion as an exosome, the authors should show that isolated EVs from Rab5Q79L-expressing cells contain miR223.

      We have previously used overexpressed Rab5(Q79L) to monitor the localization of exosomal content, specifically CD63 and YBX1, in enlarged endosomes (Liu et al. 2021, Fig. 4A, B). These endosomes exhibit a mixture of early and late endocytic markers, including CD63. (Wegner et al., 2010). Hence, the presence of Rab5(Q79L)-positive enlarged endosomes does not solely indicate early endosomes.

      20) The mentioning of P-bodies is interesting but at no time is an association addressed. This is therefore an overly speculative conclusion. Either show an association or leave this out of the manuscript.

      In a previous paper we demonstrated that YBX1 puncta colocalize with P-body markers EDC4, Dcp1 and DDX6 (Liu et al., 2021).

      21) In lines 55-58, the authors make the comment "However, many of these studies used sedimentation at ~100,000 g to collect EVs, which may also collect RNP particles not enclosed within membranes which complicates the interpretation of these data." Do RNPs not dissolve when secreted? Can the authors give a reference for this statement?

      In a previous paper, we demonstrated that the RNP Ago2 does not dissolve in the conditioned medium and is not in vesicles but sediments to the bottom of a density gradient (Temoche-Diaz et al., 2019).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Strengths:

      The study was designed as a 6-month follow-up, with repeated behavioral and EEG measurements through disease development, providing valuable and interesting findings on AD progression and the effect of early-life choline supplantation. Moreover, the behavioral data that suggest an adverse effect of low choline in WT mice are interesting and important beyond the context of AD.

      Thank you for identifying several strengths.

      Weaknesses:

      (1) The multiple headings and subheadings, focusing on the experimental method rather than the narrative, reduce the readability.

      We have reduced the number of headings.

      (2) Quantification of NeuN and FosB in WT littermates is needed to demonstrate rescue of neuronal death and hyperexcitability by high choline supplementation and also to gain further insights into the adverse effect of low choline on the performance of WT mice in the behavioral test.

      We agree and have added WT data for the NeuN and ΔFosB analyses. These data are included in the text and figures. For NeuN, the Figure is Figure 6. For ΔFosB it is Figure 7. In brief, the high choline diet restored NeuN and ΔFosB to the levels of WT mice.

      Below is Figure 6 and its legend to show the revised presentation of data for NeuN. Afterwards is the revised figure showing data for ΔFosB. After that are the sections of the Results that have been revised.

      Author response image 1.

      Choline supplementation improved NeuN immunoreactivity (ir) in hilar cells in Tg2576 animals. A. Representative images of NeuN-ir staining in the anterior DG of Tg2576 animals. (1) A section from a Tg2576 mouse fed the low choline diet. The area surrounded by a box is expanded below. Red arrows point to NeuN-ir hilar cells. Mol=molecular layer, GCL=granule cell layer, HIL=hilus. Calibration for the top row, 100 µm; for the bottom row, 50 µm. (2) A section from a Tg2576 mouse fed the intermediate diet. Same calibrations as for 1. (3) A section from a Tg2576 mouse fed the high choline diet. Same calibrations as for 1. B. Quantification methods. Representative images demonstrate the thresholding criteria used to quantify NeuN-ir. (1) A NeuN-stained section. The area surrounded by the white box is expanded in the inset (arrow) to show 3 hilar cells. The 2 NeuN-ir cells above threshold are marked by blue arrows. The 1 NeuN-ir cell below threshold is marked by a green arrow. (2) After converting the image to grayscale, the cells above threshold were designated as red. The inset shows that the two cells that were marked by blue arrows are red while the cell below threshold is not. (3) An example of the threshold menu from ImageJ showing the way the threshold was set. Sliders (red circles) were used to move the threshold to the left or right of the histogram of intensity values. The final position of the slider (red arrow) was positioned at the onset of the steep rise of the histogram. C. NeuN-ir in Tg2576 and WT mice. Tg2576 mice had either the low, intermediate, or high choline diet in early life. WT mice were fed the standard diet (intermediate choline). (1) Tg2576 mice treated with the high choline diet had significantly more hilar NeuN-ir cells in the anterior DG compared to Tg2576 mice that had been fed the low choline or intermediate diet. The values for Tg2576 mice that received the high choline diet were not significantly different from WT mice, suggesting that the high choline diet restored NeuN-ir. (2) There was no effect of diet or genotype in the posterior DG, probably because the low choline and intermediate diet did not appear to lower hilar NeuN-ir.

      Author response image 2.

      Choline supplementation reduced ∆FosB expression in dorsal GCs of Tg2576 mice. A. Representative images of ∆FosB staining in GCL of Tg2576 animals from each treatment group. (1) A section from a low choline-treated mouse shows robust ∆FosB-ir in the GCL. Calibration, 100 µm. Sections from intermediate (2) and high choline (3)-treated mice. Same calibration as 1. B. Quantification methods. Representative images demonstrating the thresholding criteria established to quantify ∆FosB. (1) A ∆FosB -stained section shows strongly-stained cells (white arrows). (2) A strict thresholding criteria was used to make only the darkest stained cells red. C. Use of the strict threshold to quantify ∆FosB-ir. (1) Anterior DG. Tg2576 mice treated with the choline supplemented diet had significantly less ∆FosB-ir compared to the Tg2576 mice fed the low or intermediate diets. Tg2576 mice fed the high choline diet were not significantly different from WT mice, suggesting a rescue of ∆FosB-ir. (2) There were no significant differences in ∆FosB-ir in posterior sections. D. Methods are shown using a threshold that was less strict. (1) Some of the stained cells that were included are not as dark as those used for the strict threshold (white arrows). (2) All cells above the less conservative threshold are shown in red. E. Use of the less strict threshold to quantify ∆FosB-ir. (1) Anterior DG. Tg2576 mice that were fed the high choline diet had less ΔFosB-ir pixels than the mice that were fed the other diets. There were no differences from WT mice, suggesting restoration of ∆FosB-ir by choline enrichment in early life. (2) Posterior DG. There were no significant differences between Tg2576 mice fed the 3 diets or WT mice.

      Results, Section C1, starting on Line 691:

      “To ask if the improvement in NeuN after MCS in Tg256 restored NeuN to WT levels we used WT mice. For this analysis we used a one-way ANOVA with 4 groups: Low choline Tg2576, Intermediate Tg2576, High choline Tg2576, and Intermediate WT (Figure 5C). Tukey-Kramer multiple comparisons tests were used as the post hoc tests. The WT mice were fed the intermediate diet because it is the standard mouse chow, and this group was intended to reflect normal mice. The results showed a significant group difference for anterior DG (F(3,25)=9.20; p=0.0003; Figure 5C1) but not posterior DG (F(3,28)=0.867; p=0.450; Figure 5C2). Regarding the anterior DG, there were more NeuN-ir cells in high choline-treated mice than both low choline (p=0.046) and intermediate choline-treated Tg2576 mice (p=0.003). WT mice had more NeuN-ir cells than Tg2576 mice fed the low (p=0.011) or intermediate diet (p=0.003). Tg2576 mice that were fed the high choline diet were not significantly different from WT (p=0.827).”

      Results, Section C2, starting on Line 722:

      “There was strong expression of ∆FosB in Tg2576 GCs in mice fed the low choline diet (Figure 7A1). The high choline diet and intermediate diet appeared to show less GCL ΔFosB-ir (Figure 7A2-3). A two-way ANOVA was conducted with the experimental group (Tg2576 low choline diet, Tg2576 intermediate choline diet, Tg2576 high choline diet, WT intermediate choline diet) and location (anterior or posterior) as main factors. There was a significant effect of group (F(3,32)=13.80, p=<0.0001) and location (F(1,32)=8.69, p=0.006). Tukey-Kramer post-hoc tests showed that Tg2576 mice fed the low choline diet had significantly greater ΔFosB-ir than Tg2576 mice fed the high choline diet (p=0.0005) and WT mice (p=0.0007). Tg2576 mice fed the low and intermediate diets were not significantly different (p=0.275). Tg2576 mice fed the high choline diet were not significantly different from WT (p>0.999). There were no differences between groups for the posterior DG (all p>0.05).”

      “∆FosB quantification was repeated with a lower threshold to define ∆FosB-ir GCs (see Methods) and results were the same (Figure 7D). Two-way ANOVA showed a significant effect of group (F(3,32)=14.28, p< 0.0001) and location (F(1,32)=7.07, p=0.0122) for anterior DG but not posterior DG (Figure 7D). For anterior sections, Tukey-Kramer post hoc tests showed that low choline mice had greater ΔFosB-ir than high choline mice (p=0.0024) and WT mice (p=0.005) but not Tg2576 mice fed the intermediate diet (p=0.275); Figure 7D1). Mice fed the high choline diet were not significantly different from WT (p=0.993; Figure 7D1). These data suggest that high choline in the diet early in life can reduce neuronal activity of GCs in offspring later in life. In addition, low choline has an opposite effect, suggesting low choline in early life has adverse effects.”

      (3) Quantification of the discrimination ratio of the novel object and novel location tests can facilitate the comparison between the different genotypes and diets.

      We have added the discrimination index for novel object location to the paper. The data are in a new figure: Figure 3. In brief, the results for discrimination index are the same as the results done originally, based on the analysis of percent of time exploring the novel object.

      Below is the new Figure and legend, followed by the new text in the Results.

      Author response image 3.

      Novel object location results based on the discrimination index. A. Results are shown for the 3 months-old WT and Tg2576 mice based on the discrimination index. (1) Mice fed the low choline diet showed object location memory only in WT. (2) Mice fed the intermediate diet showed object location memory only in WT. (3) Mice fed the high choline diet showed memory both for WT and Tg2576 mice. Therefore, the high choline diet improved memory in Tg2576 mice. B. The results for the 6 months-old mice are shown. (1-2) There was no significant memory demonstrated by mice that were fed either the low or intermediate choline diet. (3) Mice fed a diet enriched in choline showed memory whether they were WT or Tg2576 mice. Therefore, choline enrichment improved memory in all mice.

      Results, Section B1, starting on line 536:

      “The discrimination indices are shown in Figure 3 and results led to the same conclusions as the analyses in Figure 2. For the 3 months-old mice (Figure 3A), the low choline group did not show the ability to perform the task for WT or Tg2576 mice. Thus, a two-way ANOVA showed no effect of genotype (F(1,74)=0.027, p=0.870) or task phase (F(1,74)=1.41, p=0.239). For the intermediate diet-treated mice, there was no effect of genotype (F(1,50)=0.3.52, p=0.067) but there was an effect of task phase (F(1,50)=8.33, p=0.006). WT mice showed a greater discrimination index during testing relative to training (p=0.019) but Tg2576 mice did not (p=0.664). Therefore, Tg2576 mice fed the intermediate diet were impaired. In contrast, high choline-treated mice performed well. There was a main effect of task phase (F(1,68)=39.61, p=<0.001) with WT (p<0.0001) and Tg2576 mice (p=0.0002) showing preference for the moved object in the test phase. Interestingly, there was a main effect of genotype (F(1,68)=4.50, p=0.038) because the discrimination index for WT training was significantly different from Tg2576 testing (p<0.0001) and Tg2576 training was significantly different from WT testing (p=0.0003).”

      “The discrimination indices of 6 months-old mice led to the same conclusions as the results in Figure 2. There was no evidence of discrimination in low choline-treated mice by two-way ANOVA (no effect of genotype, (F(1,42)=3.25, p=0.079; no effect of task phase, F(1,42)=0.278, p=0.601). The same was true of mice fed the intermediate diet (genotype, F(1,12)=1.44, p=0.253; task phase, F(1,12)=2.64, p=0.130). However, both WT and Tg2576 mice performed well after being fed the high choline diet (effect of task phase, (F(1,52)=58.75, p=0.0001, but not genotype (F(1,52)=1.197, p=0.279). Tukey-Kramer post-hoc tests showed that both WT (p<0.0001) and Tg2576 mice that had received the high choline diet (p=0.0005) had elevated discrimination indices for the test session.”

      (4) The longitudinal analyses enable the performance of multi-level correlations between the discrimination ratio in NOR and NOL, NeuN and Fos levels, multiple EEG parameters, and premature death. Such analysis can potentially identify biomarkers associated with AD progression. These can be interesting in different choline supplementation, but also in the standard choline diet.

      We agree and added correlations to the paper in a new figure (Figure 9). Below is Figure 9 and its legend. Afterwards is the new Results section.

      Author response image 4.

      Correlations between IIS, Behavior, and hilar NeuN-ir. A. IIS frequency over 24 hrs is plotted against the preference for the novel object in the test phase of NOL. A greater preference is reflected by a greater percentage of time exploring the novel object. (1) The mice fed the high choline diet (red) showed greater preference for the novel object when IIS were low. These data suggest IIS impaired object location memory in the high choline-treated mice. The low choline-treated mice had very weak preference and very few IIS, potentially explaining the lack of correlation in these mice. (2) There were no significant correlations for IIS and NOR. However, there were only 4 mice for the high choline group, which is a limitation. B. IIS frequency over 24 hrs is plotted against the number of dorsal hilar cells expressing NeuN. The dorsal hilus was used because there was no effect of diet on the posterior hilus. (1) Hilar NeuN-ir is plotted against the preference for the novel object in the test phase of NOL. There were no significant correlations. (2) Hilar NeuN-ir was greater for mice that had better performance in NOR, both for the low choline (blue) and high choline (red) groups. These data support the idea that hilar cells contribute to object recognition (Kesner et al. 2015; Botterill et al. 2021; GoodSmith et al. 2022).

      Results, Section F, starting on Line 801:

      “F. Correlations between IIS and other measurements

      As shown in Figure 9A, IIS were correlated to behavioral performance in some conditions. For these correlations, only mice that were fed the low and high choline diets were included because mice that were fed the intermediate diet did not have sufficient EEG recordings in the same mouse where behavior was studied. IIS frequency over 24 hrs was plotted against the preference for the novel object in the test phase (Figure 9A). For NOL, IIS were significantly less frequent when behavior was the best, but only for the high choline-treated mice (Pearson’s r, p=0.022). In the low choline group, behavioral performance was poor regardless of IIS frequency (Pearson’s r, p=0.933; Figure 9A1). For NOR, there were no significant correlations (low choliNe, p=0.202; high choline, p=0.680) but few mice were tested in the high choline-treated mice (Figure 9B2).

      We also tested whether there were correlations between dorsal hilar NeuN-ir cell numbers and IIS frequency. In Figure 9B, IIS frequency over 24 hrs was plotted against the number of dorsal hilar cells expressing NeuN. The dorsal hilus was used because there was no effect of diet on the posterior hilus. For NOL, there was no significant correlation (low choline, p=0.273; high choline, p=0.159; Figure 9B1). However, for NOR, there were more NeuN-ir hilar cells when the behavioral performance was strongest (low choline, p=0.024; high choline, p=0.016; Figure 9B2). These data support prior studies showing that hilar cells, especially mossy cells (the majority of hilar neurons), contribute to object recognition (Botterill et al. 2021; GoodSmith et al. 2022).”

      We also noted that all mice were not possible to include because they died or other reasons, such a a loss of the headset (Results, Section A, Lines 463-464): Some mice were not possible to include in all assays either because they died before reaching 6 months or for other reasons.

      Reviewer #2 (Public Review):

      Strengths:

      The strength of the group was the ability to monitor the incidence of interictal spikes (IIS) over the course of 1.2-6 months in the Tg2576 Alzheimer's disease model, combined with meaningful behavioral and histological measures. The authors were able to demonstrate MCS had protective effects in Tg2576 mice, which was particularly convincing in the hippocampal novel object location task.

      We thank the Reviewer for identifying several strengths.

      Weaknesses:

      Although choline deficiency was associated with impaired learning and elevated FosB expression, consistent with increased hyperexcitability, IIS was reduced with both low and high choline diets. Although not necessarily a weakness, it complicates the interpretation and requires further evaluation.

      We agree and we revised the paper to address the evaluations that were suggested.

      Reviewer #1 (Recommendations For The Authors):

      (1) A reference directing to genotyping of Tg2576 mice is missing.

      We apologize for the oversight and added that the mice were genotyped by the New York University Mouse Genotyping core facility.

      Methods, Section A, Lines 210-211: “Genotypes were determined by the New York University Mouse Genotyping Core facility using a protocol to detect APP695.”

      (2) Which software was used to track the mice in the behavioral tests?

      We manually reviewed videos. This has been clarified in the revised manuscript. Methods, Section B4, Lines 268-270: Videos of the training and testing sessions were analyzed manually. A subset of data was analyzed by two independent blinded investigators and they were in agreement.

      (3) Unexpectedly, a low choline diet in AD mice was associated with reduced frequency of interictal spikes yet increased mortality and spontaneous seizures. The authors attribute this to postictal suppression.

      We did not intend to suggest that postictal depression was the only cause. It was a suggestion for one of many potential explanations why seizures would influence IIS frequency. For postictal depression, we suggested that postictal depression could transiently reduce IIS. We have clarified the text so this is clear (Discussion, starting on Line 960):

      If mice were unhealthy, IIS might have been reduced due to impaired excitatory synaptic function. Another reason for reduced IIS is that the mice that had the low choline diet had seizures which interrupted REM sleep. Thus, seizures in Tg2576 mice typically started in sleep. Less REM sleep would reduce IIS because IIS occur primarily in REM. Also, seizures in the Tg2576 mice were followed by a depression of the EEG (postictal depression; Supplemental Figure 3) that would transiently reduce IIS. A different, radical explanation is that the intermediate diet promoted IIS rather than low choline reducing IIS. Instead of choline, a constituent of the intermediate diet may have promoted IIS.

      However, reduced spike frequency is already evident at 5 weeks of age, a time point with a low occurrence of premature death. A more comprehensive analysis of EEG background activity may provide additional information if the epileptic activity is indeed reduced at this age.

      We did not intend to suggest that premature death caused reduced spike frequency. We have clarified the paper accordingly. We agree that a more in-depth EEG analysis would be useful but is beyond the scope of the study.

      (4) Supplementary Fig. 3 depicts far more spikes / 24 h compared to Fig. 7B (at least 100 spikes/24h in Supplementary Fig. 3 and less than 10 spikes/24h in Fig. 7B).

      We would like to clarify that before and after a seizure the spike frequency is unusually high. Therefore, there are far more spikes than prior figures.

      We clarified this issue by adding to the Supplemental Figure more data. The additional data are from mice without a seizure, showing their spikes are low in frequency.

      All recordings lasted several days. We included the data from mice with a seizure on one of the days and mice without any seizures. For mice with a seizure, we graphed IIS frequency for the day before, the day of the seizure, and the day after. For mice without a seizure, IIS frequency is plotted for 3 consecutive days. When there was a seizure, the day before and after showed high numbers of spikes. When there was no seizure on any of the 3 days, spikes were infrequent on all days.

      The revised figure and legend are shown below. It is Supplemental Figure 4 in the revised submission.

      Author response image 5.

      IIS frequency before and after seizures. A. Representative EEG traces recorded from electrodes implanted in the skull over the left frontal cortex, right occipital cortex, left hippocampus (Hippo) and right hippocampus during a spontaneous seizure in a 5 months-old Tg2576 mouse. Arrows point to the start (green arrow) and end of the seizure (red arrow), and postictal depression (blue arrow). B. IIS frequency was quantified from continuous video-EEG for mice that had a spontaneous seizure during the recording period and mice that did not. IIS frequency is plotted for 3 consecutive days, starting with the day before the seizure (designated as day 1), and ending with the day after the seizure (day 3). A two-way RMANOVA was conducted with the day and group (mice with or without a seizure) as main factors. There was a significant effect of day (F(2,4)=46.95, p=0.002) and group (seizure vs no seizure; F(1,2)=46.01, p=0.021) and an interaction of factors (F(2,4)=46.68, p=0.002)..Tukey-Kramer post-hoc tests showed that mice with a seizure had significantly greater IIS frequencies than mice without a seizure for every day (day 1, p=0.0005; day 2, p=0.0001; day 3, p=0.0014). For mice with a seizure, IIS frequency was higher on the day of the seizure than the day before (p=0.037) or after (p=0.010). For mice without a seizure, there were no significant differences in IIS frequency for day 1, 2, or 3. These data are similar to prior work showing that from one day to the next mice without seizures have similar IIS frequencies (Kam et al., 2016).

      In the text, the revised section is in the Results, Section C, starting on Line 772:

      “At 5-6 months, IIS frequencies were not significantly different in the mice fed the different diets (all p>0.05), probably because IIS frequency becomes increasingly variable with age (Kam et al. 2016). One source of variability is seizures, because there was a sharp increase in IIS during the day before and after a seizure (Supplemental Figure 4). Another reason that the diets failed to show differences was that the IIS frequency generally declined at 5-6 months. This can be appreciated in Figure 8B and Supplemental Figure 6B. These data are consistent with prior studies of Tg2576 mice where IIS increased from 1 to 3 months but then waxed and waned afterwards (Kam et al., 2016).”

      (5) The data indicating the protective effect of high choline supplementation are valuable, yet some of the claims are not completely supported by the data, mainly as the analysis of littermate WT mice is not complete.

      We added WT data to show that the high choline diet restored cell loss and ΔFosB expression to WT levels. These data strengthen the argument that the high choline diet was valuable. See the response to Reviewer #1, Public Review Point #2.

      • Line 591: "The results suggest that choline enrichment protected hilar neurons from NeuN loss in Tg2576 mice." A comparison to NeuN expression in WT mice is needed to make this statement.

      These data have been added. See the response to Reviewer #1, Public Review Point #2.

      • Line 623: "These data suggest that high choline in the diet early in life can reduce hyperexcitability of GCs in offspring later in life. In addition, low choline has an opposite effect, again suggesting this maternal diet has adverse effects." Also here, FosB quantification in WT mice is needed.

      These data have been added. See the response to Reviewer #1, Public Review Point #2.

      (7) Was the effect of choline associated with reduced tauopathy or A levels?

      The mice have no detectable hyperphosphorylated tau. The mice do have intracellular A before 6 months. This is especially the case in hilar neurons, but GCs have little (Criscuolo et al., eNeuro, 2023). However, in neurons that have reduced NeuN, we found previously that antibodies generally do not work well. We think it is because the neurons become pyknotic (Duffy et al., 2015), a condition associated with oxidative stress which causes antigens like NeuN to change conformation due to phosphorylation. Therefore, we did not conduct a comparison of hilar neurons across the different diets.

      (8) Since the mice were tested at 3 months and 6 months, it would be interesting to see the behavioral difference per mouse and the correlation with EEG recording and immunohistological analyses.

      We agree that would be valuable and this has been added to the paper. Please see response to Reviewer #1, Public Review Point #4.

      Reviewer #2 (Recommendations For The Authors):

      There were several areas that could be further improved, particularly in the areas of data analysis (particularly with images and supplemental figures), figure presentation, and mechanistic speculation.

      Major points:

      (1) It is understandable that, for the sake of labor and expense, WT mice were not implanted with EEG electrodes, particularly since previous work showed that WT mice have no IIS (Kam et al. 2016). However, from a standpoint of full factorial experimental design, there are several flaws - purists would argue are fatal flaws. First, the lack of WT groups creates underpowered and imbalanced groups, constraining statistical comparisons and likely reducing the significance of the results. Also, it is an assumption that diet does not influence IIS in WT mice. Secondly, with a within-subject experimental design (as described in Fig. 1A), 6-month-old mice are not naïve if they have previously been tested at 3 months. Such an experimental design may reduce effect size compared to non-naïve mice. These caveats should be included in the Discussion. It is likely that these caveats reduce effect size and that the actual statistical significance, were the experimental design perfect, would be higher overall.

      We agree and have added these points to the Limitations section of the Discussion. Starting on Line 1050: In addition, groups were not exactly matched. Although WT mice do not have IIS, a WT group for each of the Tg2576 groups would have been useful. Instead, we included WT mice for the behavioral tasks and some of the anatomical assays. Related to this point is that several mice died during the long-term EEG monitoring of IIS.

      (2) Since behavior, EEG, NeuN and FosB experiments seem to be done on every Tg2576 animal, it seems that there are missed opportunities to correlate behavior/EEG and histology on a per-mouse basis. For example, rather than speculate in the discussion, why not (for example) directly examine relationships between IIS/24 hours and FosB expression?

      We addressed this point above in responding to Reviewer #1, Public Review Point #4.

      (3) Methods of image quantification should be improved. Background subtraction should be considered in the analysis workflow (see Fig. 5C and Fig. 6C background). It would be helpful to have a Methods figure illustrating intermediate processing steps for both NeuN and FosB expression.

      We added more information to improve the methods of quantification. We did use a background subtraction approach where ImageJ provides a histogram of intensity values, and it determines when there is a sharp rise in staining relative to background. That point is where we set threshold. We think it is a procedure that has the least subjectivity.

      We added these methods to the Methods section and expanded the first figure about image quantification, Figure 6B. That figure and legend are shown above in response to Reviewer #1, Point #2.

      This is the revised section of the Methods, Section C3, starting on Line 345:

      “Photomicrographs were acquired using ImagePro Plus V7.0 (Media Cybernetics) and a digital camera (Model RET 2000R-F-CLR-12, Q-Imaging). NeuN and ∆FosB staining were quantified from micrographs using ImageJ (V1.44, National Institutes of Health). All images were first converted to grayscale and in each section, the hilus was traced, defined by zone 4 of Amaral (1978). A threshold was then calculated to identify the NeuN-stained cell bodies but not background. Then NeuN-stained cell bodies in the hilus were quantified manually. Note that the threshold was defined in ImageJ using the distribution of intensities in the micrograph. A threshold was then set using a slider in the histogram provided by Image J. The slider was pushed from the low level of staining (similar to background) to the location where staining intensity made a sharp rise, reflecting stained cells. Cells with labeling that was above threshold were counted.”

      (4) This reviewer is surprised that the authors do not speculate more about ACh-related mechanisms. For example, choline deficiency would likely reduce Ach release, which could have the same effect on IIS as muscarinic antagonism (Kam et al. 2016), and could potentially explain the paradoxical effects of a low choline diet on reducing IIS. Some additional mechanistic speculation would be helpful in the Discussion.

      We thank the Reviewer for noting this so we could add it to the Discussion. We had not because we were concerned about space limitations.

      The Discussion has a new section starting on Line 1009:

      “Choline and cholinergic neurons

      There are many suggestions for the mechanisms that allow MCS to improve health of the offspring. One hypothesis that we are interested in is that MCS improves outcomes by reducing IIS. Reducing IIS would potentially reduce hyperactivity, which is significant because hyperactivity can increase release of A. IIS would also be likely to disrupt sleep since it represents aberrant synchronous activity over widespread brain regions. The disruption to sleep could impair memory consolidation, since it is a notable function of sleep (Graves et al. 2001; Poe et al. 2010). Sleep disruption also has other negative consequences such as impairing normal clearance of A (Nedergaard and Goldman 2020). In patients, IIS and similar events, IEDs, are correlated with memory impairment (Vossel et al. 2016).

      How would choline supplementation in early life reduce IIS of the offspring? It may do so by making BFCNs more resilient. That is significant because BFCN abnormalities appear to cause IIS. Thus, the cholinergic antagonist atropine reduced IIS in vivo in Tg2576 mice. Selective silencing of BFCNs reduced IIS also. Atropine also reduced elevated synaptic activity of GCs in young Tg2576 mice in vitro. These studies are consistent with the idea that early in AD there is elevated cholinergic activity (DeKosky et al. 2002; Ikonomovic et al. 2003; Kelley et al. 2014; Mufson et al. 2015; Kelley et al. 2016), while later in life there is degeneration. Indeed, the chronic overactivity could cause the degeneration.

      Why would MCS make BFCNs resilient? There are several possibilities that have been explored, based on genes upregulated by MCS. One attractive hypothesis is that neurotrophic support for BFCNs is retained after MCS but in aging and AD it declines (Gautier et al. 2023). The neurotrophins, notably nerve growth factor (NGF) and brain-derived neurotrophic factor (BDNF) support the health of BFCNs (Mufson et al. 2003; Niewiadomska et al. 2011).”

      Minor points:

      (1) The vendor is Dyets Inc., not Dyets.

      Thank you. This correction has been made.

      (2) Anesthesia chamber not specified (make, model, company).

      We have added this information to the Methods, Section D1, starting on Line 375: The animals were anesthetized by isoflurane inhalation (3% isoflurane. 2% oxygen for induction) in a rectangular transparent plexiglas chamber (18 cm long x 10 cm wide x 8 cm high) made in-house.

      (3) It is not clear whether software was used for the detection of behavior. Was position tracking software used or did blind observers individually score metrics?

      We have added the information to the paper. Please see the response to Reviewer #1, Recommendations for Authors, Point #2.

      (4) It is not clear why rat cages and not a true Open Field Maze were used for NOL and NOR.

      We used mouse cages because in our experience that is what is ideal to detect impairments in Tg2576 mice at young ages. We think it is why we have been so successful in identifying NOL impairments in young mice. Before our work, most investigators thought behavior only became impaired later. We would like to add that, in our experience, an Open Field Maze is not the most common cage that is used.

      (5) Figure 1A is not mentioned.

      It had been mentioned in the Introduction. Figure B-D was the first Figure mentioned in the Results so that is why it might have been missed. We now have added it to the first section of the Results, Line 457, so it is easier to find.

      6) Although Fig 7 results are somewhat complicated compared to Fig. 5 and 6 results, EEG comes chronologically earlier than NeuN and FosB expression experiments.

      We have kept the order as is because as the Reviewer said, the EEG is complex. For readability, we have kept the EEG results last.

      (7) Though the statistical analysis involved parametric and nonparametric tests, It is not clear which normality tests were used.

      We have added the name of the normality tests in the Methods, Section E, Line 443: Tests for normality (Shapiro-Wilk) and homogeneity of variance (Bartlett’s test) were used to determine if parametric statistics could be used. We also added after this sentence clarification: When data were not normal, non-parametric data were used. When there was significant heteroscedasticity of variance, data were log transformed. If log transformation did not resolve the heteroscedasticity, non-parametric statistics were used. Because we added correlations and analysis of survival curves, we also added the following (starting on Line 451): For correlations, Pearson’s r was calculated. To compare survival curves, a Log rank (Mantel-Cox) test was performed.

      Figures:

      (1) In Fig. 1A, Anatomy should be placed above the line.

      We changed the figure so that the word “Anatomy” is now aligned, and the arrow that was angled is no longer needed.

      In Fig. 1C and 1D, the objects seem to be moved into the cage, not the mice. This schematic does not accurately reflect the Fig. 1C and 1D figure legend text.

      Thank you for the excellent point. The figure has been revised. We also updated it to show the objects more accurately.

      Please correct the punctuation in the Fig. 1D legend.

      Thank you for mentioning the errors. We corrected the legend.

      For ease of understanding, Fig. 1C and 1D should have training and testing labeled in the figure.

      Thank you for the suggestion. We have revised the figure as suggested.

      Author response image 6.

      (2) In Figure 2, error bars for population stats (bar graphs) are not obvious or missing. Same for Figure 3.

      We added two supplemental figures to show error bars, because adding the error bars to the existing figures made the symbols, colors, connecting lines and error bars hard to distinguish. For novel object location (Fig. 2) the error bars are shown in Supp. Fig. 2. For novel object recognition, the error bars are shown in Supplemental Fig. 3.

      (3) The authors should consider a Methods figure for quantification of NeuN and deltaFOSB (expansions of Fig. 5C and Fig. 6C).

      Please see Reviewer #1, Public Review Point #2.

      (4) In Figure 5, A should be omitted and mentioned in the Methods/figure legend. B should be enlarged. C should be inset, zoomed-in images of the hilus, with an accompanying analysis image showing a clear reduction in NeuN intensity in low choline conditions compared to intermediate and high choline conditions. In D, X axes could delineate conditions (figure legend and color unnecessary). Figure 5C should be moved to a Methods figure.

      We thank the review for the excellent suggestions. We removed A as suggested. We expanded B and included insets. We used different images to show a more obvious reduction of cells for the low choline group. We expanded the Methods schematics. The revised figure is Figure 6 and shown above in response to Reviewer 1, Public Review Point #2.

      (5) In Figure 6, A should be eliminated and mentioned in the Methods/figure legend. B should be greatly expanded with higher and lower thresholds shown on subsequent panels (3x3 design).

      We removed A as suggested. We expanded B as suggested. The higher and lower thresholds are shown in C. The revised figure is Figure 7 and shown above in response to Reviewer 1, Public Review Point #2.

      (6) In Figure 7, A2 should be expanded vertically. A3 should be expanded both vertically and horizontally. B 1 and 2 should be increased, particularly B1 where it is difficult to see symbols. Perhaps colored symbols offset/staggered per group so that the spread per group is clearer.

      We added a panel (A4) to show an expansion of A2 and A3. However, we did not see that a vertical expansion would add information so we opted not to add that. We expanded B1 as suggested but opted not to expand B2 because we did not think it would enhance clarity. The revised figure is below.

      Author response image 7.

      (7) Supplemental Figure 1 could possibly be combined with Figure 1 (use rounded corner rat cage schematic for continuity).

      We opted not to combine figures because it would make one extremely large figure. As a result, the parts of the figure would be small and difficult to see.

      (8) Supplemental Figure 2 - there does not seem to be any statistical analysis associated with A mentioned in the Results text.

      We added the statistical information. It is now Supplemental Figure 4:

      Author response image 8.

      Mortality was high in mice treated with the low choline diet. A. Survival curves are shown for mice fed the low choline diet and mice fed the high choline diet. The mice fed the high choline diet had a significantly less severe survival curve. B. Left: A photo of a mouse after sudden unexplained death. The mouse was found in a posture consistent with death during a convulsive seizure. The area surrounded by the red box is expanded below to show the outstretched hindlimb (red arrow). Right: A photo of a mouse that did not die suddenly. The area surrounded by the box is expanded below to show that the hindlimb is not outstretched.

      The revised text is in the Results, Section E, starting on Line 793:

      “The reason that low choline-treated mice appeared to die in a seizure was that they were found in a specific posture in their cage which occurs when a severe seizure leads to death (Supplemental Figure 5). They were found in a prone posture with extended, rigid limbs (Supplemental Figure 5). Regardless of how the mice died, there was greater mortality in the low choline group compared to mice that had been fed the high choline diet (Log-rank (Mantel-Cox) test, Chi square 5.36, df 1, p=0.021; Supplemental Figure 5A).”

      Also, why isn't intermediate choline also shown?

      We do not have the data from the animals. Records of death were not kept, regrettably.

      Perhaps labeling of male/female could also be done as part of this graph.

      We agree this would be very interesting but do not have all sex information.

      B is not very convincing, though it is understandable once one reads about posture.

      We have clarified the text and figure, as well as the legend. They are above.

      Are there additional animals that were seen to be in a specific posture?

      There are many examples, and we added them to hopefully make it more convincing.

      We also added posture in WT mice when there is a death to show how different it is.

      Is there any relationship between seizures detected via EEG, as shown in Supplemental Figure 3, and death?

      Several mice died during a convulsive seizure, which is the type of seizure that is shown in the Supplemental Figure.

      (9) Supplemental Figure 3 seems to display an isolated case in which EEG-detected seizures correlate with increased IIEs. It is not clear whether there are additional documented cases of seizures that could be assembled into a meaningful population graph. If this data does not exist or is too much work to include in this manuscript, perhaps it can be saved for a future paper.

      We have added other cases and revised the graph. This is now Supplemental Figure 4 and is shown above in response to Reviewer #1, Recommendation for Authors Point #4.

      Frontal is misspelled.

      We checked and our copy is not showing a misspelling. However, we are very grateful to the Reviewer for catching many errors and reading the manuscript carefully.

      (10) Supplemental Figure 4 seems incomplete in that it does not include EEG data from months 4, 5, and 6 (see Fig. 7B).

      We have added data for these ages to the Supplemental Figure (currently Supplemental Figure 6) as part B. In part A, which had been the original figure, only 1.2, 2, and 3 months-old mice were shown because there were insufficient numbers of each sex at other ages. However, by pooling 1.2 and 2 months (Supplemental Figure 6B1), 3 and 4 months (B2) and 5 and 6 months (B3) we could do the analysis of sex. The results are the same – we detected no sex differences.

      Author response image 9.

      A. IIS frequency was similar for each sex. A. IIS frequency was compared for females and males at 1.2 months (1), 2 months (2), and 3 months (3). Two-way ANOVA was used to analyze the effects of sex and diet. Female and male Tg2576 mice were not significantly different. B. Mice were pooled at 1.2 and 2 months (1), 3 and 4 months (2) and 5 and 6 months (3). Two-way ANOVA analyzed the effects of sex and diet. There were significant effects of diet for (1) and (2) but not (3). There were no effects of sex at any age. (1) There were significant effects of diet (F(2,47)=46.21, p<0.0001) but not sex (F(1,47)=0.106, p=0.746). Female and male mice fed the low choline diet or high choline diet were significantly different from female and male mice fed the intermediate diet (all p<0.05, asterisk). (2) There were significant effects of diet (F(2,32)=10.82, p=0.0003) but not sex (F(1,32)=1.05, p=0.313). Both female and male mice of the low choline group were significantly different from male mice fed the intermediate diet (both p<0.05, asterisk) but no other pairwise comparisons were significant. (3) There were no significant differences (diet, F(2,23)=1.21, p=0.317); sex, F(1,23)=0.844, p=0.368).

      The data are discussed the Results, Section G, tarting on Line 843:

      In Supplemental Figure 6B we grouped mice at 1-2 months, 3-4 months and 5-6 months so that there were sufficient females and males to compare each diet. A two-way ANOVA with diet and sex as factors showed a significant effect of diet (F(2,47)=46.21; p<0.0001) at 1-2 months of age, but not sex (F1,47)=0.11, p=0.758). Post-hoc comparisons showed that the low choline group had fewer IIS than the intermediate group, and the same was true for the high choline-treated mice. Thus, female mice fed the low choline diet differed from the females (p<0.0001) and males (p<0.0001) fed the intermediate diet. Male mice that had received the low choline diet different from females (p<0.0001) and males (p<0.0001) fed the intermediate diet. Female mice fed the high choline diet different from females (p=0.002) and males (p<0.0001) fed the intermediate diet, and males fed the high choline diet difference from females (p<0.0001) and males (p<0.0001) fed the intermediate diet.

      For the 3-4 months-old mice there was also a significant effect of diet (F(2,32)=10.82, p=0.0003) but not sex (F(1,32)=1.05, p=0.313). Post-hoc tests showed that low choline females were different from males fed the intermediate diet (p=0.007), and low choline males were also significantly different from males that had received the intermediate diet (p=0.006). There were no significant effects of diet (F(2,23)=1.21, p=0.317) or sex (F(1,23)=0.84, p=0.368) at 5-6 months of age.

    1. Author Response

      Reviewer #1 (Public Review):

      Weaknesses:

      Gene expression level as a confounding factor was not well controlled throughout the study. Higher gene expression often makes genes less dispensable after gene duplication. Gene expression level is also a major determining factor of evolutionary rates (reviewed in http://www.ncbi.nlm.nih.gov/pubmed/26055156). Some proposed theories explain why gene expression level can serve as a proxy for gene importance (http://www.ncbi.nlm.nih.gov/pubmed/20884723, http://www.ncbi.nlm.nih.gov/pubmed/20485561). In that sense, many genomic/epigenomic features (such as replication timing and repressed transcriptional regulation) that were assumed "neutral" or intrinsic by the authors (or more accurately, independent of gene dispensability) cannot be easily distinguishable from the effect of gene dispersibility.

      We thank the reviewer for this important comment. We totally agree that transcriptomic and epigenomic features cannot be easily distinguished from gene dispensability and do not think that these features of the elusive genes can be explained solely by intrinsic properties of the genomes. Our motivation for investigating the expression profiles of the elusive gene is to understand how they lost their functional indispensability (original manuscript L285-286 in Results). We also discussed the possibility that sequence composition and genomic location of elusive genes may be associated with epigenetic features for expression depression, which may result in a decrease of functional constraints (original manuscript L470-474 in Discussion). Nevertheless, we think that the original manuscript may have contained misleading wordings, and thus we have edited them to better convey our view that gene expression and epigenomic features are related to gene function.

      (P.2, Introduction) This evolutionary fate of a gene can also be affected by factors independent of gene dispensability, including the mutability of genomic positions, but such features have not been examined well.

      (P6, Introduction) These data assisted us to understand how intrinsic genomic features may affect gene fate, leading to gene loss by decreasing the expression level and eventually relaxing the functional importance of ʻelusiveʼ genes.

      (P33, Discussion) Another factor is the spatiotemporal suppression of gene expression via epigenetic constraints. Previous studies showed that lowly expressed genes reduce their functional dispensability (Cherry, 2010; Gout et al., 2010), and so do the elusive genes.

      Additionally, responding to the advices from Reviewers 1 and 2 [Rev1minor7 and Rev2-Major4], we have added a new section Elusive gene orthologs in the chicken microchromosomes in which we describe the relationship between the elusive genes and chicken microchromosomes. In this section, we also argue for the relationship between the genomic feature of the elusive genes and their transcriptomic and epigenomic characteristics. In the chicken genome, elusive genes did not show reduced pleiotropy of gene expression nor the epigenetic features relevant with the reduction, consistently with the moderation of nucleotide substitution rates. This also suggests that the relaxation of the ‘elusiveness’ is associated with the increase of functional indispensability.

      (P27, Elusive gene orthologs in the chicken microchromosomes in Results) Our analyses indicates that the genomic features of the elusive genes such as high GC and high nucleotide substitutions do not always correlate with a reduction in pleiotropy of gene expression that potentially leads to an increase in functional dispensability, although these features have been well conserved across vertebrates. In addition, the avian orthologs of the elusive genes did not show higher KA and KS values than those of the non-elusive genes (Figure 3; Figure 3–figure supplement 1), likely consistent with similar expression levels between them (Figure 5–figure supplement 1) (Cherry, 2010; Zhang and Yang, 2015). With respect to the chicken genome, the sequence features of the elusive genes themselves might have been relaxed during evolution.

      Ks was used by the authors to indicate mutation rates. However, synonymous mutations substantially affect gene expression levels (https://pubmed.ncbi.nlm.nih.gov/25768907/, https://pubmed.ncbi.nlm.nih.gov/35676473/). Thus, synonymous mutations cannot be simply assumed as neutral ones and may not be suitable for estimating local mutation rates. If introns can be aligned, they are better sequences for estimating the mutability of a genomic region.

      We appreciate the reviewer for this meaningful suggestion. As a response, we have computed the differences in intron sequences between the human and chimpanzee genomes and compared them between the elusive and non-elusive genes. As expected, we found larger sequence differences in introns for the elusive genes than for the non-elusive genes. In Figure 2c of the revised manuscript, we have included the distribution of KI, sequence differences in introns between the human and chimpanzee genomes for the elusive and non-elusive genes. Additionally, we have added the corresponding texts to Results and the procedure to Methods as shown below.

      (P11, Identification of human ‘elusive’ genes in Results) In addition, we computed nucleotide substitution rates for introns (KI) between human and chimpanzee (Pan troglodytes) orthologs and compared them between the elusive and non-elusive genes.

      (P11, Identification of human ‘elusive’ genes in Results) Our analysis further illuminated larger KS and KI values for the elusive genes than in the non-elusive genes (Figure 2b, c; Figure 2–figure supplement 1). Importantly, the higher rate of synonymous and intronic nucleotide substitutions, which may not affect changes in amino acid residues, indicates that the elusive genes are also susceptible to genomic characteristics independent of selective constraints on gene functions.

      (P39, Methods) To compute nucleotide sequence differences of the individual introns, we extracted 473 elusive and 4,626 non-elusive genes that harbored introns aligned with the chimpanzee genome assembly. The nucleotide differences were calculated via the whole genome alignments of hg38 and panTro6 retrieved from the UCSC genome browser.

      The term "elusive gene" is not necessarily intuitive to readers.

      We previously published a paper reporting the group of genes that we refer to as ‘elusive genes,’ lost in mammals and aves independently but retained by reptiles, in the gecko genome assembly (Hara et al., 2018, BMC Biology). We initially termed them with a more intuitive name (‘loss-prone genes’) but changed it because one of our peer-reviewers did not agree to use this name. Later on, we have continuously used this term in another paper (Hara et al., 2018, Nat. Ecol. Evol.). In addition, some other groups have used the word ‘elusive’ with a similar intention to ours (Prokop et al, 2014, PLOS ONE, doi: 10.1371/journal.pone.0092751; Ribas et al., 2011, BMC Genomics, doi: 10.1186/1471-2164-12-240). We would appreciate the reviewer’s understanding of this naming to ensure the consistency of our researches on gene loss. In the revised manuscript, we have added sentences to provide a more intuitive guide to ‘elusive genes’,

      (P6, Introduction) We previously referred to the nature of genes prone to loss as ‘elusive’(Hara et al., 2018a, 2018b). In the present study, we define the elusive genes as those that are retained by modern humans but have been lost independently in multiple mammalian lineages. As a comparison of the elusive genes, we retrieved the genes that were retained by almost all of the mammalian species examined and defined them as ‘non-elusive’, representing those persistent in the genomes.

      Reviewer #3 (Public Review):

      Overall, the study is descriptive and adds incremental evidence to an existing body of extensive gene loss literature. The topic is specialised and will be of interest to a niche audience. The text is highly redundant, repeating the same false positive issue in the introduction, methods, and discussion sections, while no clear conclusion or interpretation of their main findings are presented.

      Major comments

      While some of the false discovery rate issues of gene loss detection were addressed in the presented pipeline, the authors fail to test one of the most severe cases of mis-annotating gene loss events: frameshift mutations which cause gene annotation pipelines to fail reporting these genes in the first place. Running a blastx or diamond blastx search of their elusive and non-elusive gene sets against all other genomes, should further enlighten the robustness of their gene loss detection approach

      For the revised manuscript, we have refined the elusive gene set as the reviewer suggested. In the genome assemblies, we have searched for the orthologs of the elusive genes for the species in which they were missing. The search has been conducted by querying amino acid sequences of the elusive genes with tblastn as well as MMSeqs2 that performed superior to tblastn in sensitivity and computational speed. In addition, regarding another comment by Reviewer 3. we have searched for the orthologs by referring to existing ortholog annotations. We used the ortholog annotations implemented in RefSeq instead of those from the TOGA pipeline: both employ synteny conservation. We have coordinated the identified orthologs with our gene loss criteria–absence from all the species used in a particular taxon–and excluded 268 genes from the original elusive gene set. These genes contain those missing in the previous gene annotations used in the original manuscript but present in the latest ones, as well as those falsely missing due to incorrect inference of gene trees. Finally, the refined set of 813 elusive genes were subject to comparisons with the non-elusive genes. Importantly, these comparisons retained the significantly different trends of the particular genomic, transcriptomic, and epigenomic features between them except for very few cases (Table R1 included below). This indicates that both initial and revised sets of the elusive genes reflect the nature of the ‘elusiveness,’ though the initial set contained some noises. We have modified the numbers of elusive genes in the corresponding parts of the manuscript including figures and tables. Additionally, we have added the validation procedures in Methods.

      Table R1. Difference in statistical significances across different elusive gene sets *The other features showed significantly different trends between the elusive and non-elusive genes for all of the elusive gene sets and thus are not included in this table.

      (P38 in Methods) The gene loss events inferred by molecular phylogeny were further assessed by synteny-based ortholog annotations implemented in RefSeq, as well as a homolog search in the genome assemblies (Table S2) with TBLASTN v2.11.0+ (Altschul et al., 1997) and MMSeqs2 (Steinegger and Söding, 2017) referring to the latest RefSeq gene annotations (last accessed on 2 Dec, 2022). This procedure resulted in the identification of 813 elusive genes that harbored three or fewer duplicates. Similarly, we extracted 8,050 human genes whose orthologs were found in all the mammalian species examined and defined them as non-elusive genes.

      The reviewer also suggested us investigating falsely-missing genes due to frameshift mutations (in this case we guess that the reviewer assumed the genome assembly that falsely included frameshift mutations). This requires us to search for the orthologs by revisiting the sequencing reads because the frameshift is sometimes caused by indels of erroneous basecalling. We have selected five elusive genes and searched for the fragments of orthologs in sequencing reads for the species in which they are missing. We have retrieved sequencing reads corresponding to the genome assemblies from NCBI SRA and performed sequence similarity search using the program Diamond against the amino acid sequences of the elusive genes and could not find the frameshift that potentially causes the mis-annotation of the elusive genes.

      Along this line, we noticed that when annotation files were pooled together via CD-Hit clustering, a 100% identity threshold was chosen (Methods). Since some of the pooled annotations were drawn from less high quality assemblies which yield higher likelihoods of mismatches between annotations, enforcing a 100% identity threshold will artificially remove genes due to this strict constraint. It will be paramount for this study to test the robustness of their findings when 90% and 95% identity thresholds were selected.

      cd-hit clustering with 100% sequence identity only clusters those with identical (and sometimes truncated) sequences, and, in the cluster, the sequences other than the representative are discarded. This means that the sequences remain if they are not identical to the other ones. If the similarity threshold is lowered, both identical and highly similar sequences are clustered with each other, and more sequences are discarded. Therefore, our approach that employs clustering with 100% similarity may minimize false positive gene loss.

      While some statistical tests were applied (although we do recommend consulting a professional statistician, since some identical distributions tend to show significantly low p-values), the authors fail to discuss the fact that their elusive gene set comprises of ~5% of all human genes (assuming 21,000 genes), while their non-elusive set represents ~40% of all genes. In other words, the authors compare their sequence and genomic features against the genomic background rather than a biological signal (nonelusiveness). An analysis whereby 1,081 genes (same number as elusive set) are randomly sampled from the 21,000 gene pool is compared against the elusive and non-elusive distributions for all presented results will reveal whether the non-elusive set follows a background distribution (noise) or not.

      Our study aims to elucidate the characteristics of genes that differentiate their fates, retention or loss. To achieve this, we put this characterization into the comparison between the elusive and non-elusive genes. This comparison highlighted clearly different phylogenetic signals for gene loss between elusive and non-elusive genes, allowing us to extract the features associated with the loss-prone nature. The random sampling set suggested by Reviewer may largely consists of the remainders that were not classified by the elusive and non-elusive genes. However, these remainders may contain a considerable number of genes with distinctive phylogenetic signatures rather than the intermediates between the elusive and nonelusive genes: the genes with multiple loss events in more restricted taxa than our criterion, the ones with frequent duplication, etc. Therefore, we think that a comparison of the elusive genes with the random-sampling set does not achieve our objective: the comparison of the clearly different phylogenetic signals.

      We also wondered whether the authors considered testing the links between recombination rate / LD and the genomic locations of their elusive genes (again compared against randomly sampled genes)?

      We have retrieved fine-scale recombination rate data of males and females from https://www.decode.com/addendum/ (Suppl. Data of Kong, A et al., Nature, 467:1099–1103, 2010) and have compared them between the gene regions of the elusive and non-elusive genes. Both comparisons show no significant differences: average 0.829 and 0.900 recombinations/kb for the elusive and non-elusive genes, respectively, p=0.898, for males; average 0.836 and 0.846 recombinations/kb for the elusive and non-elusive genes, respectively, p=0.256, for females).

      Given the evidence presented in Figure 6b, we do not agree with the statement (l.334-336): "These observations suggest that the elusive genes are unlikely to be regulated by distant regulatory elements". Here, a data population of ~1k genes is compared against a data population of ~8k genes and the presented difference between distributions could be a sample size artefact. We strongly recommend retesting this result with the ~1k randomly sampled genes from the total ~21,000 gene pool and then compare the distributions.

      Analogous random sampling analysis should be performed for Fig 6a,d

      As described above, our study does not intend to extract signals from background. To make the comparison objectives clear, we have revised the corresponding sentence as below.

      (P22, Transcriptomic natures of elusive genes in Results) These observations suggest that the elusive genes are unlikely to be regulated by distant regulatory elements compared with the non-elusive genes (Figure 6b).

      We didn't see a clear pattern in Figure 7. Please quantify enrichments with statistical tests. Even if there are enriched regions, why did the authors choose a Shannon entropy cutoff configuration of <1 (low) and >1 (high)? What was the overall entropy value range? If the maximum entropy value was 10 or 100 or even more, then denoting <1 as low and >1 as high seems rather biased.

      To use Figure 7 in a new section in Results, we have added an ideogram showing the distribution of the genes that retain the chicken orthologs in microchromosomes. In response to the comment by Reviewer 2, we have performed statistical tests and found that the elusive genes were significantly more abundant in orthologs in microchromosomes than the non-elusive genes. Furthermore, the observation that the elusive genes prefer to be located in gene-rich regions was already statistically supported (Figure 2f).

      As shown in Figure 5, Shannon’s H' ranged from zero to approximately 4 (exact maximum value is 3.97) and 5 (5.11) for the GTEx and Descartes gene expression datasets, respectively. Although the threshold H'=1 was an arbitrarily set, we think that it is reasonable to classify the genes with high pleiotropy from those with low pleiotropy.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Wei & Robles et al seek to estimate the heritability contribution of Neanderthal Informative Markers (NIM) relative to SNPs that arose in modern humans (MH). This is a question that has received a fair amount of attention in recent studies, but persistent statistical limitations have made some prior results difficult to interpret. Of particular concern is the possibility that heritability (h^2) attributed to Neanderthal markers might be tagging linked variants that arose in modern humans, resulting in overestimation of h^2 due to Neanderthal variants. Neanderthal variants also tend to be rare, and estimating the contribution of rare alleles to h^2 is challenging. In some previous studies, rare alleles have been excluded from h^2 estimates.

      Wei & Robles et al develop and assess a method that estimates both total heritability and per-SNP heritability of NIMs, allowing them to test whether NIM contributions to variation in human traits are similar or substantially different than modern human SNPs. They find an overall depletion of heritability across the traits that they studied, and found no traits with enrichment of heritability due to NIMs. They also developed a 'fine-mapping' procedure that aims to find potential causal alleles and report several potentially interesting associations with putatively functional variants.

      Strengths of this study include rigorous assessment of the statistical methods employed with simulations and careful design of the statistical approaches to overcome previous limitations due to LD and frequency differences between MH and NIM variants. I found the manuscript interesting and I think it makes a solid contribution to the literature that addresses limitations of some earlier studies.

      My main questions for the authors concern potential limitations of their simulation approach. In particular, they describe varying genetic architectures corresponding to the enrichment of effects among rare alleles or common alleles. I agree with the authors that it is important to assess the impact of (unknown) architecture on the inference, but the models employed here are ad hoc and unlikely to correspond to any mechanistic evolutionary model. It is unclear to me whether the contributions of rare and common alleles (and how these correspond with levels of LD) in real data will be close enough to these simulated schemes to ensure good performance of the inference.

      In particular, the common allele model employed makes 90% of effect variants have frequencies above 5% -- I am not aware of any evolutionary model that would result in this outcome, which would suggest that more recent mutations are depleted for effects on traits (of course, it is true that common alleles explain much more h^2 under neutral models than rare alleles, but this is driven largely by the effect of frequency on h^2, not the proportion of alleles that are effect alleles). Likewise, the rare allele model has the opposite pattern, with 90% of effect alleles having frequencies under 5%. Since most alleles have frequencies under 5% anyway (~58% of MH SNPs and ~73% of NIM SNPs) this only modestly boosts the prevalence of low frequency effect alleles relative to their proportion. Some selection models suggest that rare alleles should have much bigger effects and a substantially higher likelihood of being effect alleles than common alleles. I'm not sure this situation is well-captured by the simulations performed. With LD and MAF annotations being applied in relatively wide quintile bins, do the authors think their inference procedure will do a good job of capturing such rare allele effects? This seems particularly important to me in the context of this paper, since the claim is that Neanderthal alleles are depleted for overall h^2, but Neanderthal alleles are also disproportionately rare, meaning they could suffer a bigger penalty. This concern could be easily addressed by including some simulations with additional architectures to those considered in the manuscript.

      We thank the reviewers for their thoughtful comments regarding rare alleles, and we agree that our RARE simulations only moderately boosted the enrichment of rare alleles in causal mutations. To address this, we added new simulations, ULTRA RARE, in which SNPs with MAF < 0.01 constitute 90% of the causal variants. Similar to our previous simulations, we use 100,000 and 10,000 causal variants to mimic highly polygenic and moderately polygenic phenotypes, and 0.5 and 0.2 for high and moderately heritable phenotypes. We similarly did three replicated simulations for each combination and partitioned the heritability with Ancestry only annotation, Ancestry+MAF annotation, Ancestry+LD annotation, and Ancestry+MAF+LD annotation. Our Ancestry+MAF+LD annotation remains calibrated in this setting (see Figure below). We believe this experiment strengthens our paper and have added it as Fig S2.

      While we agree that these architectures are ad-hoc and are unlikely to correspond to realistic evolutionary scenarios, we have chosen these architectures to span the range of possible architecture so that the skew towards common or rare alleles that we have explored are extreme. The finding that our estimates are calibrated across the range that we have explored leads us to conclude that our inferences should be robust.

      More broadly, we concur with the reviewer that our results (as well as others in the field) may need to be revisited as our view of the genetic architecture of complex traits evolves. The methods that we propose in this paper are general enough to explore such architectures in the future by choosing a sufficiently large set of annotations that match the characteristics across NIMs and MH SNPs. A practical limitation to this strategy is that the use of a large number of annotations can result in some annotations being assigned a small number of SNPs which would, in turn, reduce the precision of our estimates. This limitation is particularly relevant due to the smaller number of NIMs compared to MH SNPs (around 250K vs around 8M).

      Reviewer #2 (Public Review):

      The goal of the work described in this paper is to comprehensively describe the contribution of Neanderthal-informative mutations (NIMs) to complex traits in modern human populations. There are some known challenges in studying these variants, namely that they are often uncommon, and have unusually long haplotype structures. To overcome these, the authors customized a genotyping array to specifically assay putative Neanderthal haplotypes, and used a recent method of estimating heritability that can explicitly account for differences in MAF and LD.

      This study is well thought-out, and the ability to specifically target the genotyping array to the variants in question and then use that information to properly control for population structure is a massive benefit. The methodology also allowed them to include rarer alleles that were generally excluded from previous studies. The simulations are thorough and convincingly show the importance of accounting for both MAF and LD in addition to ancestry. The fine-mapping done to disentangle effects between actual Neanderthal variants and Modern human ones on the same haplotype also seems reasonable. They also strike a good balance between highlighting potentially interesting examples of Neanderthal variants having an effect on phenotype without overinterpreting association-based findings.

      The main weakness of the paper is in its description of the work, not the work itself. The paper currently places a lot of emphasis on comparing these results to prior studies, particularly on its disagreement with McArthur, et al. (2021), a study on introgressed variant heritability that was also done primarily in UK Biobank. While they do show that the method used in that study (LDSR) does not account for MAF and LD as effectively as this analysis, this work does not support the conclusion that this is a major problem with previous heritability studies. McArthur et al. in fact largely replicate these results that Neanderthal variants (and more generally regions with Neanderthal variants) are depleted of heritability, and agree with the interpretation that this is likely due to selection against Neanderthal alleles. I actually find this a reassuring point, given the differences between the variant sets and methods used by the two studies, but it isn't mentioned in the text. Where the two studies differ is in specifics, mainly which loci have some association with human phenotypes; McArthur et al. also identified a couple groups of traits that were exceptions to the general rule of depleted heritability. While this work shows that not accounting for MAF and LD can lead to underestimating NIM heritability, I don't follow the logic behind the claim that this could lead to a false positive in heritability enrichment (a false negative would be more likely, surely?). There are also more differences between this and previous heritability studies than just the method used to estimate heritability, and the comparisons done here do not sufficiently account for these. A more detailed discussion to reconcile how, despite its weaknesses, LDSR picks up similar broad patterns while disagreeing in specifics is merited.

      We agree with the reviewer that our results are generally concordant with those of McArthur et al. 2021 and this concordance is reassuring given the differences across our studies. The differences across the studies, wherein McArthur et al. 2021 identify a few traits with elevated heritability while we do not, could arise due to reasons beyond the methodological differences such as differences in the sets of variants analyzed. We have partially explored this possibility in the revised manuscript by analyzing the set of introgressed variants identified by the Sprime method (which was used in McArthur et al. 2021) using our method: we continue to observe a pattern of depletion with no evidence for enrichment. We hypothesize that the reason why LDSR picks up similar overall patterns despite its limitations is indicative of the nature of selection on introgressed alleles (which, in turn, influences the dependence of effect size on allele frequency and LD). Investigating this hypothesis will require a detailed understanding of the LDSR results on parameters such as the MAF threshold on the regression SNPs and the LD reference SNPs and the choice of the LD reference panel.

      Not accounting for MAF and LD can underestimate NIM heritability but can both underestimate and overestimate heritability at MH SNPs. Hence, tests that compare per-SNP heritability at NIMs to MH SNPs can therefore lead to false positives both in the direction of enrichment and depletion.

      We have now written in the Discussion: “In spite of these differences in methods and NIMs analyzed, our observation of an overall pattern of depletion in the heritability of introgressed alleles is consistent with the findings of McArthur et al. The robustness of this pattern might provide insights into the nature of selection against introgressed alleles”

      In general this work agrees with the growing consensus in the field that introgressed Neanderthal variants were selected against, such that those that still remain in human populations do not generally have large effects on phenotypes. There are exceptions to this, but for the most part observed phenotypic associations depend on the exact set of variants being considered, and, like those highlighted in this study, still lack more concrete validation. While this paper does not make a significant advance in this general understanding of introgressed regions in modern populations, it does increase our knowledge in how best to study them, and makes a good attempt at addressing issues that are often just mentioned as caveats in other studies. It includes a nice quantification of how important these variables are in interpreting heritability estimates, and will be useful for heritability studies going forward.

    1. Author Response:

      Reviewer #1:

      The dependence of cell volume growth rate on cell size and cell cycle is a long-standing fundamental question that has traditionally been addressed by using unicellular model organisms with simple geometry, for which rough volume estimates can be obtained from bright field images. While it became soon apparent that the volume growth rate depends on cell volume, the experimental error associated with such measurements made it difficult to determine the exact dependencies. This challenge is even more significant for animal cells, whose complex and dynamic geometry makes accurate volume measurements extremely difficult. Other measures for cell size, including mass or fluorescent reporters for protein content, partially bypassed this problem. However, it becomes increasingly clear that cell mass and volume are not strictly coupled, making accurate volume measurements essential. In their previous work, Cadart and colleagues established a 'fluorescent exclusion method', which allows accurate volume measurements of cells with complex geometry. In the present manuscript, Cadart et al. now take the next step and measure the growth trajectories of 1700 HeLa cell cycles with further improved accuracy, providing new insights into animal cell growth.

      They convincingly demonstrate that throughout large parts of the cell cycle, individual cells exhibit exponential growth, with the volume-normalized specific growth rate moderately increasing after G1-phase. At the very early stages of the cell cycle, cells exhibit a more complex growth behavior. The authors then go on and analyze the growth rate fluctuations of individual cells, identifying a decrease of the variance of the specific growth rate with cell volume and observed time scale. The authors conclude that the observed growth fluctuations are consistent with additive noise of the absolute growth rate.

      The experiments and analysis presented by Cadart et al. are carefully and well executed, and the insights provided (as well as the method established) are an important contribution to our understanding of cell growth. My major concern is that the observed fluctuation pattern seems largely consistent with what would be expected if the fluctuations stem from experimental measurement noise. This fact is appropriately acknowledged, and the authors aim to address this issue by analyzing background noise. However, further controls may be necessary to unambiguously attribute the measured noise to biological fluctuations, rather than experimental error.

      We thank the reviewer for their positive feedback and for the appreciation of our work. We performed a series of experimental controls to address the main issue regarding the measured fluctuation pattern, which indicate that it should be of biological origin.

      1.) To address whether the observed fluctuations could be due to experimental error, the authors analyze the fluctuations recorded in a cell-sized area of the background, and find that the background fluctuations are small compared to the fluctuations of the volume measurements. I think this is a very important control that supports the interpretation of the authors. However, I am not convinced that the actual measurement error is necessarily of the same amplitude as the fluctuations of the background. The background control will control for example for variations of light intensity and fluctuations of the fluorophore intensity. But what about errors in the cell segmentation? Or movement of the cells in 3D, which could be relevant because the collected light might be dependent on the distance from the surface? Is cell autofluorescence relevant at all? I am aware that accurately estimating the experimental error is exceptionally difficult, and I am also not entirely sure what would be the perfect control (if it even exists). Nevertheless, I think more potential sources of error should be addressed before the measured noise can be confidently attributed to biological sources. Maybe the authors could measure objects with constant volume over time, for example vesicles? As long as the segmented area contains the complete cell, the measured volume should not change if the area is increased. Is this the case?

      We are grateful to the reviewer for all these useful suggestions. We performed all these controls on the sources of noise, and we discuss them in the revised manuscript.

      2.) I am particularly puzzled by the fact that even at the timescale of the frame rate, fluctuations seem not to be correlated between 2 consecutive time points (Fig. 5-S2b). This seems plausible for (some) sources of experimental error. Maybe an experiment with fast time resolution would reveal the timescale over which the fluctuations persist - which could then give us a hint about the source?

      We performed this analysis, finding an autocorrelation time of a few minutes, and we report our results below:

      In the main text and in the new Figure 5 – Supplement 3, we report the results of newly performed 20 sec timelapse experiments over one hour to investigate the timescale of volume fluctuations. The autocvariance function analysis on the detrended curves shows that fluctuations decay over a few minutes (Figure 5 – Supplement 3a-c), a timescale that matches the analysis of the 10 min timelapse experiments.

      Copy of Figure 5 – Supplement 3: Autocovariance analysis shows that the timescale of volume fluctuation is around 760 seconds. a) Cells measured every 20 sec (n=177) and linearly detrended reach a covariance of 0 at a lag of 760 sec. b) As a control, the background fluctuations are not autocorrelated (20 sec, n=92), providing further evidence that cell volume fluctuations likely have biological origin. c) The autocovariance analysis for cells measured every 10 min confirms that fluctuations covary for a lag of 10-20 min.

      3.) The authors use automated smoothing of the measurement and removed outliers based on an IQR-criteria. While this seems reasonable if the aim is to get a robust measurement of the average behavior, I find it questionable with respect to the noise measurements. Since no minimum time scale has been associated with the fluctuations interpreted as biological in origin, what is the justification of removing 'outliers', i.e. the feature that the authors are actually interested in? Why would the largest fluctuations be of technical origin, and the smaller fluctuations exclusively biological?

      The IQR-criteria is designed to remove only rare and obvious outliers (i.e. a jump in volume of more than 15% in 1 frame -10 minutes- which arguably cannot happen biologically). Fluctuations of smaller range are kept (see examples below). We looked back at the raw data and calculated that the IQR filtering removes a total of 337 measurement points out of 99935 initial points (0.03% of the points).

      Figure D: Three examples of single cell trajectories with raw volume measurement (red dots) and points removed with the IQR filtering (blue dots). The IQR criteria is very stringent and removes only the very large ‘bumps’ in cell volume measured (2 left plots) while it keeps fluctuations of smaller amplitude (right plot).

      4.) If I understood correctly, each volume trajectory spans one complete cell cycle. If this is the case, does Fig. 1e imply that many cell cycles take less than 2-3 hours? Is this really the case, and if so, what are the implications for some of the interpretations (especially the early cell cycle part)?

      In this study, we performed experiments on a time scale comparable to the cell cycle time (~ 24hours) and recorded single-cell volume trajectories. Since the cells are not synchronized, we have very few complete cell cycles (~ 100, Fig. 1f). Fig. 1e shows the distribution of the duration of all individual curves, regardless of the fraction of the cell cycle they span, hence the very short duration for some cells.

      Reviewer #2:

      In this paper, the authors use a volume exclusion-based measurements to quantify single cell trajectories of volume increase in HeLa cells. The study represents one of the most careful measurements on volume regulation in animal cells and presents evidence for feedback mechanisms that slow the growth of larger cells. This is an important demonstration of cell autonomous volume regulation.

      While the subject matter of the present study is important, the insights provided are significantly limited because the authors did not place their findings in the context of previous literature. The authors present what seems to be remarkably accurate single cell growth trajectories. In animal cells, a joint dependency of growth rate on cell size and cell cycle stage has been previously reported (see Elife 2018 PMID: 29889021 and Science 2009 PMID: 19589995). In Ginzberg et al, it is reported "Our data revealed that, twice during the cell cycle, growth rates are selectively increased in small cells and reduced in large cells". Nonetheless, these previous studies do not negate the novelty in Cadart et al. While both Cadart and Ginzberg investigate a dependency of cellular growth rate on cell size and cell cycle stage, the two studies are complimentary. This is because, while Ginzberg characterise the growth in cell mass, Cadart characterise the growth in cell volume. The authors should compare the findings from these previous studies with their own and draw conclusions from the similarities and differences. Are the cell cycle stage dependent growth rate similar or different when cell size is quantified as mass or volume? Does the faster growth of smaller cells (the negative correlation of growth rate and cell size) occur in different cell cycle stages when growth is quantified by volume as compared to mass?

      We are grateful to the reviewer for their appreciation of the value of our study. Following their remarks, we have extended our Discussion section to incorporate a more careful discussion of these findings. We believe that the main contribution of our study is finding evidence of phase- dependent regulation of growth rate and identifying an additive noise on volume steps, this noise has constant amplitude, hence fluctuations of specific growth rate decrease with volume, but specific growth rate (in the bulk of the cell cycle) does not decrease.

    1. Author Response:

      Reviewer #1 (Public Review):

      In this manuscript, the authors leverage novel computational tools to detect, classify and extract information underlying sharp-wave ripples, and synchronous events related to memory. They validate the applicability of their method to several datasets and compare it with a filtering method. In summary, they found that their convolutional neural network detection captures more events than the commonly used filter method. This particular capability of capturing additional events which traditional methods don't detect is very powerful and could open important new avenues worth further investigation. The manuscript in general will be very useful for the community as it will increase the attention towards new tools that can be used to solve ongoing questions in hippocampal physiology.

      We thank the reviewer for the constructive comments and appreciation of the work.

      Additional minor points that could improve the interpretation of this work are listed below:

      • Spectral methods could also be used to capture the variability of events if used properly or run several times through a dataset. I think adjusting the statements where the authors compare CNN with traditional filter detections could be useful as it can be misleading to state otherwise.

      We thank the reviewer for this suggestion. We would like to emphasize that we do not advocate at all for disusing filters. We feel that a combination of methods is required to improve our understanding of the complex electrophysiological processes underlying SWR. We have adjusted the text as suggested. In particular, a) we removed the misleading sentence from the abstract, and instead declared the need for new automatic detection strategies; b) we edited the introduction similarly, and clarified the need for improved online applications.

      • The authors show that their novel method is able to detect "physiological relevant processes" but no further analysis is provided to show that this is indeed the case. I suggest adjusting the statement to "the method is able to detect new processes (or events)".

      We have corrected text as suggested. In particular, we declare that “The new method, in combination with community tagging efforts and optimized filter, could potentially facilitate discovery and interpretation of the complex neurophysiological processes underlying SWR.” (page 12).

      • In Fig.1 the authors show how they tune the parameters that work best for their CNN method and from there they compare it with a filter method. In order to offer a more fair comparison analogous tuning of the filter parameters should be tested alongside to show that filters can also be tuned to improve the detection of "ground truth" data.

      Thank you for this comment. As explained before, see below the results of the parameter study for the filter in the very same sessions used for training the CNN. The parameters chosen (100- 300Hz band, order 2) provided maximal performance in the test set. Therefore, both methods are similarly optimized along training. This is now included (page 4): “In order to compare CNN performance against spectral methods, we implemented a Butterworth filter, which parameters were optimized using the same training set (Fig.1-figure supplement 1D).”

      • Showing a manual score of the performance of their CNN method detection with false positive and false negative flags (and plots) would be clarifying in order to get an idea of the type of events that the method is able to detect and fails to detect.

      We have added information of the categories of False Positives for both the CNN and the filter in the new Fig.4F. We have also prepared an executable figure to show examples and to facilitate understanding how the CNN works. See new Fig.5 and executable notebook https://colab.research.google.com/github/PridaLab/cnn-ripple-executable-figure/blob/main/cnn-ripple-false-positive-examples.ipynb

      • In fig 2E the authors show the differences between CNN with different precision and the filter method, while the performance is better the trends are extremely similar and the numbers are very close for all comparisons (except for the recall where the filter clearly performs worse than CNN).

      This refers to the external dataset (Grosmark and Buzsaki 2016), which is now in the new Fig.3E. To address this point and to improve statistical report, we have added more data resulting in 5 sessions from 2 rats. Data confirm better performance of CNN model versus the filter. The purpose of this figure is to show the effect of the definition of the ground truth on the performance by different methods, and also the proper performance of the CNN on external datasets without retraining. Please, note that in Grosmark and Buzsaki, SWR detection was conditioned on the

      coincidence of both population synchrony and LFP definition thus providing a “partial ground truth” (i.e. SWR without population firing were not annotated in the dataset).

      • The authors acknowledge that various forms of SWRs not consistent with their common definition could be captured by their method. But theoretically, it could also be the case that, due to the spectral continuum of the LFP signals, noisy features of the LFP could also be passed as "relevant events"? Discussing this point in the manuscript could help with the context of where the method might be applied in the future.

      As suggested, we have mentioned this point in the revised version. In particular: “While we cannot discard noisy detection from a continuum of LFP activity, our categorization suggest they may reflect processes underlying buildup of population events (de la Prida et al., 2006). In addition, the ability of CA3 inputs to bring about gamma oscillations and multi-unit firing associated with sharp-waves is already recognized (Sullivan et al., 2011), and variability of the ripple power can be related with different cortical subnetworks (Abadchi et al., 2020; Ramirez- Villegas et al., 2015). Since the power spectral level operationally defines the detection of SWR, part of this microcircuit intrinsic variability may be escaping analysis when using spectral filters” (page 16).

      • In fig. 5 the authors claim that there are striking differences in firing rate and timings of pyramidal cells when comparing events detected in different layers (compare to SP layer). This is not very clear from the figure as the plots 5G and 5H show that the main differences are when compare with SO and SLM.

      We apologize for generating confusion. We meant that the analysis was performed by comparing properties of SWR detected at SO, SR and SLM using z- values scored by SWR detected at SP only). We clarified this point in the revised version: “We found larger sinks and sources for SWR that can be detected at SLM and SR versus those detected at SO (Fig.7G; z-scored by mean values of SWR detected at SP only).” (page 14).

      • Could the above differences be related to the fact that the performance of the CNN could have different percentages of false-positive when applied to different layers?

      The rate of FP is similar/different across layers: 0.52 ± 0.21 for SO, 0.50 ± 0.21 for SR and 0.46 ± 0.19 for SLM. This is now mentioned in the text: “No difference in the rate of False Positives between SO (0.52 ± 0.21), SR (0.50 ± 0.21) and SLM (0.46 ± 0.19) can account for this effect.” (page 12)

      Alternatively, could the variability be related to the occurrence (and detection) of similar events in neighboring spectral bands (i.e., gamma events)? Discussion of this point in the manuscript would be helpful for the readers.

      We have discussed this point: “While we cannot discard noisy detection from a continuum of LFP activity, our categorization suggest they may reflect processes underlying buildup of population events (de la Prida et al., 2006). In addition, the ability of CA3 inputs to bring about gamma oscillations and multi-unit firing associated with sharp-waves is already recognized (Sullivan et al., 2011), and variability of the ripple power can be related with different cortical subnetworks (Abadchi et al., 2020; Ramirez-Villegas et al., 2015).” (Page 16)

      Overall, I think the method is interesting and could be very useful to detect more nuance within hippocampal LFPs and offer new insights into the underlying mechanisms of hippocampal firing and how they organize in various forms of network events related to memory.

      We thank the reviewer for constructive comments and appreciation of the value of our work.

      Reviewer #2 (Public Review):

      Navas-Olive et al. provide a new computational approach that implements convolutional neural networks (CNNs) for detecting and characterizing hippocampal sharp-wave ripples (SWRs). SWRs have been identified as important neural signatures of memory consolidation and retrieval, and there is therefore interest in developing new computational approaches to identify and characterize them. The authors demonstrate that their network model is able to learn to identify SWRs by showing that, following the network training phase, performance on test data is good. Performance of the network varied by the human expert whose tagging was used to train it, but when experts' tags were combined, performance of the network improved, showing it benefits from multiple input. When the network trained on one dataset is applied to data from different experimental conditions, performance was substantially lower, though the authors suggest that this reflected erroneous annotation of the data, and once corrected performance improved. The authors go on to analyze the LFP patterns that nodes in the network develop preferences for and compare the network's performance on SWRs and non-SWRs, both providing insight and validation about the network's function. Finally, the authors apply the model to dense Neuropixels data and confirmed that SWR detection was best in the CA1 cell layer but could also be detected at more distant locations.

      The key strengths of the manuscript lay in a convincing demonstration that a computational model that does not explicitly look for oscillations in specific frequency bands can nevertheless learn to detect them from tagged examples. This provides insight into the capabilities and applications of convolutional neural networks. The manuscript is generally clearly written and the analyses appear to have been carefully done.

      We thank the reviewer for the summary and for highlighting the strengths of our work.

      While the work is informative about the capabilities of CNNs, the potential of its application for neuroscience research is considerably less convincing. As the authors state in the introduction, there are two potential key benefits that their model could provide (for neuroscience research): 1. improved detection of SWRs and 2. providing additional insight into the nature of SWRs, relative to existing approaches. To this end, the authors compare the performance of the CNN to that of a Butterworth filter. However, there are a number of major issues that limit the support for the authors' claims:

      Please, see below the answers to specific questions, which we hope clarify the validity of our approach

      • Putting aside the question of whether the comparison between the CNN and the filter is fair (see below), it is unclear if even as is, the performance of the CNN is better than a simple filter. The authors argue for this based on the data in Fig. 1F-I. However, the main result appears to be that the CNN is less sensitive to changes in the threshold, not that it does better at reasonable thresholds.

      This comment now refers to the new Fig.2A (offline detection) and Fig.2C,D (online detection). Starting from offline detection, yes, the CNN is less sensitive than the filter and that has major consequences both offline and online. For the filter to reach it best performance, the threshold has to be tuned which is a time-consuming process. Importantly, this is only doable when you know the ground truth. In practical terms, most lab run a semi-automatic detection approach where they first detect events and then they are manually validated. The fact that the filter is more sensible to thresholds makes this process very tedious. Instead, the CNN is more stable.

      In trying to be fair, we also tested the performance of the CNN and the filter at their best performance (i.e. looking for the threshold f¡providing the best matching with the ground truth). This is shown at Fig.3A. There are no differences between methods indicating the CNN meet the gold standard provided the filter is optimized. Note again this is only possible if you know the ground truth because optimization is based in looking for the best threshold per session.

      Importantly, both methods reach their best performance at the expert’s limit (gray band in Fig.3A,B). They cannot be better than the individual ground truth. This is why we advocate for community tagging collaborations to consolidate sharp-wave ripple definitions.

      Moreover, the mean performance of the filter across thresholds appears dramatically dampened by its performance on particularly poor thresholds (Fig. F, I, weak traces). How realistic these poorly tested thresholds are is unclear. The single direct statistical test of difference in performance is presented in Fig. 1H but it is unclear if there is a real difference there as graphically it appears that animals and sessions from those animals were treated as independent samples (and comparing only animal averages or only sessions clearly do not show a significant difference).

      Please, note this refers to online detection. We are not sure to understand the comment on whether the thresholds are realistic. To clarify, we detect SWR online using thresholds we similarly optimize for the filter and the CNN over the course of the experiment. This is reported in Fig.2C as both, per session and per animals, reaching statistical differences (we added more experiments to increase statistical power). Since, online defined thresholds may still not been the best, we then annotated these data and run an additional posthoc offline optimization analysis which is presented in Fig.2D. We hope this is now more clear in the revised version.

      Finally, the authors show in Fig. 2A that for the best threshold the CNN does not do better than the filter. Together, these results suggest that the CNN does not generally outperform the filter in detecting SWRs, but only that it is less sensitive to usage of extreme thresholds.

      We hope this is now clarified. See our response to your first bullet point

      Indeed, I am not convinced that a non-spectral method could even theoretically do better than a spectral method to detect events that are defined by their spectrum, assuming all other aspects are optimized (such as combining data from different channels and threshold setting)

      As can be seen in the responses to the editor synthesis, we have optimized the filter parameter similarly (new Fig.1-supp-1D) and there is no improvement by using more channels (see below). In any case, we would like to emphasize that we do not advocate at all for disusing filters. We feel that a combination of methods is required to improve our understanding of the complex electrophysiological processes underlying SWR.

      • The CNN network is trained on data from 8 channels but it appears that the compared filter is run on a single channel only. This is explicitly stated for the online SWR detection and presumably, that is the case for the offline as well. This unfair comparison raises the possibility that whatever improved performance the CNN may have may be due to considerably richer input and not due to the CNN model itself. The authors state that a filter on the data from a single channel is the standard, but many studies use various "consensus" heuristics, e.g. in which elevated ripple power is required to be detected on multiple channels simultaneously, which considerably improves detection reliability. Even if this weren't the case, because the CNN learns how to weight each channel, to argue that better performance is due to the nature of the CNN it must be compared to an algorithm that similarly learns to optimize these weights on filtered data across the same number of channels. It is very likely that if this were done, the filter approach would outperform the CNN as its performance with a single channel is comparable.

      We appreciate this comment. Using one channel to detect SWR is very common for offline detection followed by manual curation. In some cases, a second channel is used either to veto spurious detections (using a non-ripple channel) or to confirm detection (using a second ripple channel and/or a sharp-wave) (Fernandez-Ruiz et al., 2019). Many others use detection of population firing together with the filter to identify replay (such as in Grosmark and Buzsaki 2019, where ripples were conditioned on the coincidence of both population firing and LFP detected ripples). To address this comment, we compared performance using different combinations of channels, from the standard detection at the SP layer (pyr) up to 4 and 8 channels around SP using the consensus heuristics. As can be seen filter performance is consistent across configurations and using 8 channels is not improving detection. We clarify this in the revised version: ”We found no effect of the number of channels used for the filter (1, 4 and 8 channels), and chose that with the higher ripple power” (see caption of Fig.1-supp-1D).

      • Related to the point above, for the proposed CNN model to be a useful tool in the neuroscience field it needs to be amenable to the kind of data and computational resources that are common in the field. As the network requires 8 channels situated in close proximity, the network would not be relevant for numerous studies that use fewer or spaced channels. Further, the filter approach does not require training and it is unclear how generalizable the current CNN model is without additional network training (see below). Together, these points raise the concern that even if the CNN performance is better than a filter approach, it would not be usable by a wide audience.

      Thank you for this comment. To handle with different input channel configurations, we have developed an interpolation approach, which transform any data into 8-channel inputs. We are currently applying the CNN without re-training to data from several labs using different electrode number and configurations, including tetrodes, linear silicon probes and wires. Results confirm performance of the CNN. Since we cannot disclose these third-party data here, we have looked for a new dataset from our own lab to illustrate the case. See below results from 16ch silicon probes (100 um inter-electrode separation), where the CNN performed better than the filter (F1: p=0.0169; Precision, p=0.0110; 7 sessions, from 3 mice). We found that the performance of the CNN depends on the laminar LFP profile, as Neuropixels data illustrate.

      • A key point is whether the CNN generalizes well across new datasets as the authors suggest. When the model trained on mouse data was applied to rat data from Grosmark and Buzsaki, 2016, precision was low. The authors state that "Hence, we evaluated all False Positive predictions and found that many of them were actually unannotated SWR (839 events), meaning that precision was actually higher". How were these events judged as SWRs? Was the test data reannotated?

      We apologize for not explaining this better in the original version. We choose Grosmark and Buzsaki 2016 because it provides an “incomplete ground truth”, since (citing their Methods) “Ripple events were conditioned on the coincidence of both population synchrony events, and LFP detected ripples”. This means there are LFP ripples not included in their GT. This dataset provides a very good example of how the experimental goal (examining replay and thus relying in population firing plus LFP definitions) may limit the ground truth.

      Please, note we use the external dataset for validation purposes only. The CNN model was applied without retraining, so it also helps to exemplify generalization. Consistent with a partial ground truth, the CNN and the filter recalled most of the annotated events, but precision was low. By manually validating False Positive detections, we re-annotated the external dataset and both the CNN and the filter increased precision.

      To make the case clearer, we now include more sessions to increase the data size and test for statistical effects (Fig.3E). We also changed the example to show more cases of re-annotated events (Fig.3D). We have clarified the text: “In that work, SWR detection was conditioned on the coincidence of both population synchrony and LFP definition, thus providing a “partial ground truth” (i.e. SWR without population firing were not annotated in the dataset).” (see page 7).

      • The argument that the network improves with data from multiple experts while the filter does not requires further support. While Fig. 1B shows that the CNN improves performance when the experts' data is combined and the filter doesn't, the final performance on the consolidated data does not appear better in the CNN. This suggests that performance of the CNN when trained on data from single experts was lower to start with.

      This comment refers to the new Fig.3B. We apologize for not have had included a between- method comparison in the original version. To address this, we now include a one-way ANOVA analysis for the effect of the type of the ground truth on each method, and an independent one- way ANOVA for the effect of the method in the consolidated ground truth. To increase statistical power we have added more data. We also detected some mistake with duplicated data in the original figure, which was corrected. Importantly, the rationale behind experts’ consolidated data is that there is about 70% consistency between experts and so many SWR remain not annotated in the individual ground truths. These are typically some ambiguous events, which may generate discussion between experts, such as sharp-wave with population firing and few ripple cycles. Since the CNN is better in detecting them, this is the reason supporting they improve performance when data from multiple experts are integrated.

      Further, regardless of the point in the bullet point above, the data in Fig. 1E does not convincingly show that the CNN improves while the filter doesn't as there are only 3 data points per comparison and no effect on F1.

      Fig.1E shows an example, so we guess the reviewer refers to the new Fig.2C, which show data on online operation, where we originally reported the analysis per session and per animal separately with only 3 mice. We have run more experiments to increase the data size and test for statistical effects (8 sessions, 5 mice; per sessions p=0.0047; per mice p=0.033; t-test). This is now corrected in the text and Fig.1C, caption. Please, note that a posthoc offline evaluation of these online sessions confirmed better performance of the CNN versus the filter, for all normalized thresholds (Fig.2D).

      • Apart from the points above regarding the ability of the network to detect SWRs, the insight into the nature of SWRs that the authors suggest can be achieved with CNNs is limited. For example, the data in Fig. 3 is a nice analysis of what the components of the CNN learn to identify, but the claim that "some predictions not consistent with the current definition of SWR may identify different forms of population firing and oscillatory activities associated to sharp-waves" is not thoroughly supported. The data in Fig. 4 is convincing in showing that the network better identifies SWRs than non-SWRs, but again the insight is about the network rather than about SWRs.

      In the revised version, have now include validation of all false positives detected by the CNN and the filter (Fig.4F). To facilitate the reader examining examples of True Positive and False Positive detection we also include a new figure (Fig.5), which comes with the executable code (see page 9). We also include comparisons of the features of TP events detected by both methods (Fig.2B), where is shown that SWR events detected by the CNN exhibited features more similar to those of the ground truth (GT), than those detected by the filter. We feel the entire manuscript provides support to these claims.

      Finally, the application of the model on Neuropixels data also nicely demonstrates the applicability of the model on this kind of data but does not provide new insight regarding SWRs.

      We respectfully disagree. Please, note that application to ultra-dense Neuropixels not only apply the model to an entirely new dataset without retraining, but it shows that some SWR with larger sinks and sources can be actually detected at input layers (SO, SR and SLM). Importantly, those events result in different firing dynamics providing mechanistic support for heterogeneous behavior underlying, for instance, replay.

      In summary, the authors have constructed an elegant new computational tool and convincingly shown its validity in detecting SWRs and applicability to different kinds of data. Unfortunately, I am not convinced that the model convincingly achieves either of its stated goals: exceeding the performance of SWR detection or providing new insights about SWRs as compared to considerably simpler and more accessible current methods.

      We thank you again for your constructive comments. We hope you are now convinced on the value of the new method in light to the new added data.

    1. Author Response:

      Reviewer #1:

      The authors found a switch between "retrospective", sensory recruitment-like representations in visual regions when a motor response could not be planned in advance, and "prospective" action-like representations in motor regions when a specific button response could be anticipated. The use of classifiers trained on multiple tasks - an independent spatial working memory task, spatial localizer, and a button-pressing task - to decode working memory representations makes this a strong study with straightforward interpretations well-supported by the data. These analyses provide a convincing demonstration that not only are different regions involved when a retrospective code is required (or alternatively when a prospective code can be used), but the retrospective representations resemble those evoked by perceptual input, and the prospective representations resemble those evoked by actual button presses.

      I have just a couple of points that could be elaborated on:

      1. While there is a clear transition from representations in visual cortex to representations in sensorimotor regions when a button press can be planned in advance, the visual cortex representations do not disappear completely (Figs 2B and C). Is the most plausible interpretation that participants just did not follow the cue 100% of the time, or that some degree of sensory recruitment is happening in visual cortex obligatorily (despite being unnecessary for the task) and leading to a more distributed, and potentially more robust code?

      This is a very good point, and indeed could be considered surprising. While previous work suggests that sensory recruitment is not obligatory when an item can be dropped from memory entirely (e.g., Harrison & Tong, 2009; Lewis-Peacock et al., 2012; Sprague et al., 2014, Sprague et al., 2016; Lorenc et al., 2020), other work suggests that an item which might still be relevant later in a trial (i.e., a socalled “unattended memory item”) can still be decoded during the delay (see the re-analyses in Iamshchinina et al., 2021 from the original Christophel et al. 2018 paper). In short, we cannot exclude that in our paradigm there is some low-grade sensory recruitment happening in visual cortex, even when an action-oriented code can theoretically be used. This would be consistent with a more distributed code, which could potentially increase the overall robustness of working memory.

      At the same time, as the reviewer points out, there is a possibility that on some fraction of trials, participants failed to perfectly encode the cue, or forgot the cue, which might mean they were using a sensory-like code even on some trials in the informative cue condition. This is a reasonable possibility given that we used a trial-by-trial interleaved design, where participants needed to pay close attention on each trial in order to know the current condition. Since we averaged decoding performance across all trials, the above-chance decoding accuracy could be driven by a small fraction of trials during which spatial strategies were used despite the informative nature of the preview disk.

      Finally, another factor is the averaging of data across multiple TRs from the delay period. In Figure 2B, the decoding was performed using data that was averaged over several TRs around the middle of the delay period (8-12.8 seconds from trial start). This interval is early enough that the process of re-coding a representation from sensory to motor cortex may not be complete yet, so this might be an explanation for the relatively high decoding accuracy seen in the informative condition in Figure 2B. Indeed, the time-resolved analyses (Figure 2C, Figure 2 – figure supplement 1) show that the decoding accuracy for the informative condition continues to decline later in the delay period, though it does not go entirely to chance (with the possible exception of area V1).

      Of course, our ability to decode spatial position despite participants having the option to use a pure action-oriented code may be due to a combination of all of the above: some amount of low-grade obligatory sensory recruitment, as well as occasional trials with higher-precision spatial memory due to a missed cue. We have added a paragraph to the discussion to now acknowledge these possibilities.

      Finally, although it is conceptually important to consider the reasons why decoding in the uninformative condition did not drop entirely to chance, we also note that whether the decoding goes to chance in one condition is not critical to the main findings of our paper. The data show a robust difference between the spatial decoding accuracy in visual cortex between the two conditions, which indicates that the relative amount of information in visual cortex was modulated by the task condition, regardless of what the absolute information content was in each condition.

      1. To what extent might the prospective code reflect an actual finger movement (even just increased pressure on the button to be pressed) in advance of the button press? For instance, it could be the case that the participant with extremely high button press-trained decoding performance in 4B, especially, was using such a strategy. I know that participants were instructed not to make overt button presses in advance, but I think it would be helpful to elaborate a bit on the evidence that these action-related representations are truly "working memory" representations.

      This is a good point, and we acknowledge the possibility of some amount of preparatory motor activity during the delay period on trials in the informative condition. However, we still interpret the delayperiod representations during the informative condition as a signature of working memory, for several reasons.

      First, the participants were explicitly instructed to withhold overt finger movements until the final probe disk was shown. We monitored participants closely during their task training phase, which took place outside the scanner, for early button presses, and ensured that they understood and followed the directive to withhold a button press until the correct time. We also confirmed that participants were not engaging in any noticeable motor rehearsal behaviors, such as tapping their fingers just above the buttons. During the scans, we also monitored participants using a video feed that was positioned in a way that allowed us to see their hands on the response box and confirmed that participants were not making any overt finger movements during the delay period. Additionally, most of our participants were relatively experienced, having participated in at least one other fMRI study with our group in the past, and therefore we expect them to have followed the task instructions accurately.

      The distribution of response times for trials in the informative condition also provides some evidence against the idea that participants were already making a button press ahead of the response window. The earliest presses occurred around 250 ms (see below figure, left panel). This response time is consistent with the typical range of human choice response times observed experimentally (e.g. Luce, 1991), suggesting that participants did not execute a physical response in advance of the probe disk appearance, but waited until the response disk stimulus appeared to begin motor response execution.

      Finally, even if we assume that some amount of low-grade motor preparatory activity was occurring, this is still broadly consistent with the way that working memory has been defined in past literature. Past work has distinguished between retrospective and prospective working memory, with retrospective memory being similar in format to previously encountered sensory stimuli, and prospective memory being more closely aligned with upcoming events or actions (Funahashi, Chafee, & Goldman-Rakic, 1993; Rainer, Rao & D’Esposito, 1999; Curtis, Rao, & D’Esposito, 2004; Rahmati et al., 2018; Nobre & Stokes, 2019). Indeed, the transformation of a memory representation from a retrospective code to prospective memory code is often associated with increased engagement of circuits directly related to motor control (Schneider, Barth, & Wascher, 2017; Myers, Stokes, & Nobre, 2017). According to this framework, covert motor preparation could be considered a representation at the extreme end of the prospective memory continuum. Also consistent with this idea, past work has demonstrated that the selection and manipulation of items in working memory can be accompanied by systematic eye movements biased to the locations at which memoranda were previously presented (Spivey & Geng, 2001; Ferreira et al., 2008; van Ede et al., 2019b; van Ede et al. 2020). These physical eye movements may indeed play a functional role in the retrieval of items from memory (Ferreira et al., 2008; van Ede et al., 2019b). These findings suggest that working memory is tightly linked with both the planning and execution of motor actions, and that the mnemonic representations in our task, even if they include some degree of covert motor preparatory activity, are within the realm of representations that can be defined as working memory.

      We have now included a discussion of this issue in the text of our manuscript.

      Reviewer #2:

      Henderson, Rademaker and Serences use fMRI to arbitrate between theories of visual working memory proposing fixed x flexible loci for maintaining information. By comparing activation patterns in tasks with predictable x unpredictable motor responses, they find different extents of information retrieval in sensory- x motor-related areas, thus arguing that the amount/format of retrospective sensory-related x prospective motor-related information maintained depends on what is strategically beneficial for task performance.

      I share the importance of this fundamental question and the enthusiasm for the conclusions, and I applaud the advanced methodology. I did, however, struggle with some aspects of the experimental design and (therefore) the logic of interpretation. I hope these are easily addressable.

      Conceptual points:

      1. The main informative x non-informative conditions differ more than just in the knowledge about the response. In the informative case, participants could select both the relevant sensory information (light, dark shade) and the corresponding response. In essence, their task was done, and they just needed to wait for a later go signal - the second disk. (The activity in the delay could be considered to be one of purely motor preparation or of holding a decision/response.) In the uninformative condition, neither was sensory information at the spatial location relevant and nor could the response be predicted. Participants had, instead, to hold on to the spatial location to apply it to the second disk. These conditions are more different than the authors propose and therefore it is not straightforward to interpret findings in the framework set up by the authors. A clear demonstration for the question posed would require participants to hold the same working-memory content for different purposes, but here the content that needs to be held differs vastly between conditions. The authors may argue this is, nevertheless, the essence of their point, but this is a weak strawman to combat.

      It is true that the conditions in our task differ in several respects, including the content of the representation that must be stored. The uninformative condition trials required the participant to maintain a high-precision, sensory-like spatial representation of the target stimulus, without the ability to plan a motor response or re-code the representation into a coarser format. In contrast, the informative condition trials allowed the participant to re-code their representation into a more actionoriented format than the representation needed for the uninformative condition trials, and the code is also binary (right or left) rather than continuous.

      However, we do not think these differences present an issue for the interpretation of our study. The primary goal of our study was to demonstrate that the brain regions and representational formats utilized for working memory storage may differ depending on parameters of the task, rather than having fixed loci or a single underlying neural mechanism. To achieve this, we intentionally created conditions that are meant to sit at fairly extreme ends of the continuum of working memory task paradigms employed in past work. Our uninformative condition is similar to past studies of spatial working memory with human participants that encourage high-precision, sensory-like codes (i.e., Bays & Husain, 2008; Sprague et al., 2014; Sprague et al., 2016; Rahmati et al., 2018) and our informative condition is more similar to classic delayed-saccade task studies in non-human primates, which often allowed explicit motor planning (Funahashi et al., 1989; Goldman-Rakic, 1995). By having the same participants perform these distinct task conditions on interleaved trials, we can better understand the relationship between these task paradigms and how they influence the mechanisms of working memory.

      Importantly, it is not trivial or guaranteed that we should have found a difference in neural representations across our task conditions. In particular, an alternative perspective presented in past work is that the memory representations detected in early visual cortex in various tasks are actually not essential to mnemonic storage (Leavitt, Mendoza-Halliday, & Martinez-Trujillo, 2017; Xu, 2020). On this view, if visual cortex representations are not functionally relevant for the task, one might have predicted that our spatial decoding accuracy in early visual areas would have been similar across conditions, with visual cortex engaged in an obligatory manner regardless of the exact format of the representation required. Instead, we found a dramatic difference in decoding accuracy across our task conditions. This finding underscores the functional importance of early visual cortex in working memory maintenance, because its engagement appears to be dependent on the format of the representation required for the current task.

      Relatedly, some past work has also suggested that in the context of an oculomotor delayed response task, the maintenance of action-oriented motor codes can be associated with topographically specific patterns of activation in early visual cortex which resemble those recorded during sensory-like spatial working memory maintenance (Saber et al., 2015; Rahmati et al., 2018). This is true for both prosaccade trials, in which saccade goals are linked to past sensory inputs, and anti-saccade trials, in which motor plans are dissociated from past sensory inputs. These findings indicate that even for task conditions which on the surface would appear to require very different cognitive strategies, there can, at least in some contexts, be a substantial degree of overlap between the neural mechanisms supporting sensory-like and action-oriented working memory. This again highlights the novelty of our findings, in which we demonstrate a robust dissociation between the brain areas and neural coding format that support working memory maintenance for different task conditions, rather than overlapping mechanisms for all types of working memory.

      Additionally, there are important respects in which the task conditions have similarities, rather than being entirely different. As pointed out by Reviewer #1, the decoding of spatial information in early visual cortex regions did not drop entirely to chance in the informative condition, even by the end of the delay period (Figure 2C, Figure 2 – figure supplement 1). As discussed above in our reply to R1, this finding may suggest that the neural code in the informative condition continues to rely on visual cortex activation to some extent, even when an action-oriented coding strategy is available. This possibility of a partially distributed code suggests that while the two conditions in our task appear different in terms of the optimal strategy associated with each one, in practice the neural mechanisms supporting the tasks may be somewhat overlapping (although the different mechanisms are differentially recruited based on task demands, which is our main point).

      Another aspect of our results which suggests a degree of similarity between the task conditions is that the univariate delay period activation in early visual cortex (V1-hV4) was not significantly different between conditions (Figure 1 – figure supplement 1). Thus, it is not simply the case that the participants switched from relying purely on visual cortex to purely on motor cortex – the change in information content instead reflects a much more strategically graded change to the pattern of neural activation. This point is elaborated further in the response to point (2) below.

      1. Given the nature of the manipulation and the fact that the nature of the upcoming trial (informative x uninformative) was cued, how can effects of anticipated difficulty, arousal, or other nuisance variables be discounted? Although pattern-based analyses suggest the effects are not purely related to general effects (authors argue this in the discussion, page 14), general variables can interact with specific aspects of information processing, leading to modulation of specific effects.

      There are several aspects of our results which suggest that our results are not due to effects such as anticipated difficulty or general arousal. First, we designed our experiment using a randomly interleaved trial order, such that participants could not anticipate experimental condition on a trialby-trial basis. Participants only learned which condition each trial was in when the condition cue (color change at fixation; Figure 1A) appeared, which happened 1.5 seconds into the delay period. Thus, any potential effects of anticipated difficulty could not have influenced the initial encoding of the target stimulus, and would have had to take effect later in the trial. Second, as the reviewer pointed out, we did not observe any statistically significant modulation of the univariate delay period BOLD signal in early visual ROIs V1-hV4 between task conditions (Figure 1D, Figure 1 – figure supplement 1), which argues against the idea that there is a global modulation of early visual cortex induced by arousal or changes in difficulty.

      Additionally, our results demonstrate a dissociation between univariate delay period activation in IPS and sensorimotor cortex ROIs as a function of task condition (Figure 1D, Figure 1 – figure supplement 1). In each IPS subregion (IPS0-IPS3), the average BOLD signal was significantly greater during the uninformative versus the informative condition at several timepoints in the delay period, while in S1, M1, and PMc, average signal was significantly greater for the informative than the uninformative condition at several timepoints. If a global change in mean arousal or anticipated difficulty were a main driving factor in our results, then we would have expected to see an increase in the univariate response throughout the brain for the more difficult task condition (i.e., the uninformative condition). Instead, we observed effects of task condition on univariate BOLD signal that were specific to particular ROIs. This indicates that modulations of neural activation in our task reflect a more finegrained change in neural processing, rather than a global change in arousal or anticipated difficulty.

      Furthermore, to determine whether the changes in decoding accuracy in early visual cortex were specific to the memory representation or reflected a more general change in signal-to-noise ratio, we provide a new analysis assessing the possibility that processing of incoming sensory information differed between our two conditions. As mentioned above, initial sensory processing of the memory target stimulus was equated across conditions, since participants didn’t know the task condition until the cue was presented 1.5s into the trial. However, because the “preview disk” was presented after the cue, it is possible that the preview disk stimulus was processed differently as a function of task condition. If evidence for differential processing of the preview disk stimulus is present, this might suggest that non-mnemonic factors – such as arousal – might influence the observed differences in decoding accuracy because they should interact with the processing of all stimuli. However, a lack of evidence for differential processing of the preview disk would be consistent with a mnemonic source of differences between task conditions.

      As shown in the new figure below (now Figure 2 – figure supplement 3), we used a linear decoder to measure the representation of the “preview disk” stimulus that was shown to participants early in the delay period, just after the condition cue (Figure 1A). This disk has a light and dark half separated by a linear boundary whose orientation can span a range of 0°-180°. To measure the representation of the disk’s orientation, we binned the data into four bins centered at 0°, 45°, 90°, and 135°, and trained two binary decoders to discriminate the bins that were 90° apart (an adapted version of the approach shown in Figure 2A; similar to Rademaker et al., 2019). Importantly, the orientation of this disk was random with respect to the memorized spatial location, allowing us to run this analysis independently from the spatial-position decoding in the main manuscript text.

      We found that in both conditions, the orientation of the preview disk boundary could be decoded from early visual cortex (all p-values<0.001 for V1-hV4 in both conditions; evaluated using nonparametric statistics as described in Methods), with no significant difference between our two task conditions (all p-values>0.05 for condition difference in V1-hV4). This indicates that in both task conditions, the incoming sensory stimulus (“preview disk”) was represented with similar fidelity in early visual cortex. At the same time, and in the same regions, the representation of the remembered spatial stimulus was significantly stronger in the uninformative condition than the informative condition. Therefore, the difference between task conditions appears to be specific to the quality of the spatial memory representation itself, rather than a change in the overall signal-to-noise ratio of representations in early visual cortex. This suggests that the difference between task conditions in early visual cortex reflects a difference in the brain networks that support memory maintenance in the two conditions, rather than extra processing of the preview disk in one condition over the other, a more general effect of arousal, or anticipated difficulty.

      This result is also relevant to the concerns raised by the reviewer in point (1) regarding the possibility that the selection of relevant sensory information (i.e., the light/dark side of the disk) was different between the two task conditions. Since the decoding accuracy for the preview disk orientation did not differ between task conditions, this argues against the idea that differential processing of the preview disk may have contributed to the difference in memory decoding accuracy that we observed.

      1. I see what the authors mean by retrospective and prospective codes, but in a way all the codes are prospective. Even the sensory codes, when emphasized, are there to guide future discriminations or to add sensory granularity to responses, etc. Perhaps casting this in terms of sensory/perceptual x motor/action~ may be less problematic.

      This is a good point, and we agree that in some sense all the memory codes could be considered prospective because in both conditions, the participant has some knowledge of the way that their memory will be probed in the future, even when they do not know their exact response yet. We have changed our language in the text to reflect the suggested terms “perceptual” and “action”, which will hopefully also make the difference between the conditions clearer to the reader.

      1. In interpreting the elevated univariate activation in the parietal IPSO-3 area, the authors state "This pattern is consistent with the use of a retrospective spatial code in the uninformative condition and a prospective motor code in the informative condition". (page 6) (Given points 1 and 3 above) Instead, one could think of this as having to hold onto a different type of information (spatial location as opposed to shading) in uninformative condition, which is prospectively useful for making the necessary decision down the line.

      It is true that a major difference between the two conditions was the type of information that the participants had to retain, with a sensory-like spatial representation being required for the uninformative condition, and a more action-oriented (i.e., left or right finger) representation being required for the informative condition. To clarify, the participant never had to explicitly hold onto the shading (light or dark gray side of the disk), since the shading was always linked to a particular finger, and this mapping was known in advance at the start of each task run (although we did change this mapping across task runs within each participant to counterbalance the mapping of light/dark and the left/right finger – one mapping used in the first scanner session, the other mapping used in the second scanning session). We have clarified this sentence and we have removed the use of the terms “retrospective” and “prospective” as suggested in the previous comment. The sentence now reads: “This pattern is consistent with the use of a spatial code in the uninformative condition and a motor code in the informative condition.”

      Other points to consider:

      1. Opening with the Baddeley and Hitch 1974 reference when defining working memory implicitly implies buying into that particular (multi-compartmental) model. Though Baddeley and Hitch popularised the term, the term was used earlier in more neutral ways or in different models. It may be useful to add a recent more neutral review reference too?

      This is a nice suggestion. We have added a few more references to the beginning of the manuscript, which should together present a more neutral perspective (Atkinson & Shiffron, 1968; and Jonides, Lacey and Nee, 2005).

      1. The body of literature showing attention-related selection/prioritisation in working memory linked to action preparation is also relevant to the current study. There's a nice review by Heuer, Ohl, Rolfs 2020 in Visual Cognition.

      We thank the reviewer for pointing out this interesting body of work, which is indeed very relevant here. We have added a new paragraph to our discussion which includes a discussion of this paper and its relation to our work.

    1. Author Responses

      Reviewer #3 (Public Review):

      Alexandre et al. fit a mathematical model of viral-host dynamics to previously-published data from three SARS-CoV-2 challenge studies in non-human primates and identify immune markers that correlate with "protection" (as measured by viral loads) as well or better than knowing whether an animal was naive, vaccinated, or recovered from natural infection. Crucially, the use of this model allowed for summarizing the complex time-dependent outcome data (viral sgRNA and gRNA loads over time) as a small number of more interpretable parameters (e.g., within-host viral infectivity, infected cell death rates, virion production rates) while allowing for intra-individual variation in a statistically rigorous fashion. Vaccine correlates of protection are notoriously difficult to identify and could be extremely valuable when assessing risks and designing vaccine dosages and booster schedules. The methodological approach developed in this paper is broadly applicable and a worth-while contribution by itself. In the context of the particular data analyzed here, the statistically-predictive immune markers showed reassuring consistency between the two studies using protein-based vaccines, although the third study using a mRNA-based vaccine differed. The conclusions have two limitations, the first of which is directly acknowledged by the authors while the second is not:

      1. The definition of "protection" is limited to the within-host cellular level. While within-host transmission is certainly related to between-host transmission and disease severity, many other factors play a role as well; this limitation is nicely acknowledged by the authors.

      2. The models may be overfit to the data, although this concern is somewhat tempered by the finding that application to the two protein-based vaccine studies yielded broadly similar results. Predictive statistical models of the type used here would ideally be tested on a held-out set of test data from the same type of experiment. The repeated use of BIC in a stepwise model selection framework with many predictors and limited biological replicates is risky.

      To address the reviewer’s comment about the repeated use of BIC as a model selection criterion in a stepwise selection procedure, we performed a small simulation to ensure the robustness of BIC despite the multiplicity of tests. We simulated, for each of the 18 NHPs, 25 longitudinal variables as white-noise random variables by varying the variances from 1 to 10%. Figure 1 shows the results we obtained after applying our algorithm with these variables as time-varying covariates. In the figure, the vertical black solid line represents the value of BIC obtained with the model without covariates, and the green dashed line the one obtained with β and δ adjusted for groups. Results appears as robust to the multiplicity of the tests as all adjustments for white-noise variables yield similar BIC values and degrade the model, compared to the one without covariates.

      In addition, as mentioned in our response to the comment 4b) of the reviewer 1, we tested the robustness of the results using several selection criteria (AIC, BIC, LL, interindividual variability). All criteria led to similar results.

      To mention this point in the manuscript, we created the Appendix 2 “BICc as selection criteria and multiple testing adjustment” in which we present this additional work. This additional file was mentioned in the manuscript at the page 31, Line 666.

    1. Author Response:

      Reviewer #2:

      Weaknesses of the Methods and Results:

      1) In my view, the experiment does not allow to unambiguously disentangle self vs. other distinction (as mentioned in the abstract "..we investigated how affect sharing and self-other distinction interact.."). For example, genuine vs. pretended pain could be distinguished from the participants own experience in a comparable way. The higher rating of unpleasantness for genuine pain in others does not necessary mean that the participants cannot separate own from others experiences.

      We thank the reviewer for raising this issue and for prompting us to further clarify and better state our research purpose. In terms of its original theoretical foundation and motivation, the current study aimed to investigate whether and how neural signatures underlying two essential components of empathy, namely affect sharing and self-other distinction, track individual responses to genuine vs. pretended pain. We agree though that our experimental design does not allow to disentangle unequivocally the precise aspects of self- and other-related processing in the two main conditions of interest (genuine pain or pretended pain). We thus modified any wording suggesting otherwise, so as to avoid further misunderstanding by readers.

      Accordingly, we have provided a more elaborate theoretical clarification in the Abstract and Introduction about our particular interest in studying self-other distinction and its neural correlates in the right supramarginal gyrus (rSMG) during empathy. We also mention as a potential limitation that our design did not aim to explicitly quantify self-other distinction.

      Action taken: In the manuscript, we have made the following changes:

      1) We modified the sentence "[...] we investigated how affect sharing and self-other distinction interact [...]" to

      “[...] we investigated how the brain network involved in affect sharing and self-other distinction underpinned [...] ” in the Abstract (P. 1).

      Besides, we modified another sentence “[...] to investigate the hypothesized distinct interactions between affective response and self-other distinction [...]” to

      “[...] to investigate the hypothesized brain patterns of affective responses and self-other distinction [...]” in the Introduction (P. 4).

      2) We added sentences in the Discussion (P. 13): “An additional limitation was that our study design did not aim to explicitly quantify self-other distinction. Rather, in line with previous research and based on our theoretical framework and rationale, we inferred the engagement of this process from the experimental conditions and the associated behavioral and neural responses. We expect our findings to prompt and inform future research designed to quantify and experimentally disentangle self- and other-related processes more explicitly.”

      2) The experimental design does not unambiguously allows to disentangle genuine vs pretended pain from other factors, such as the differences in pain expression, painful feeling in others and higher unpleasantness in these two conditions. I understand that the intensity pain expression, painful feeling in others and unpleasantness for others is inherently tied to genuine vs. pretended pain. But the author already saw that the instruction of "genuine vs. pretented" influenced the ratings of pain expression. Hence, this allows two interpretation of the results: either the influence from the anterior Insula on the rSMG is driven by higher perceived pain expression, painful feeling in others and unpleasantness or by the conditions of genuine vs. pretended pain. Or (more likely) by an interaction between these factors. It would, for example help to explore the association between the aIns-rSMG interaction pain expression ratings (or painful feeling in others or higher unpleasantness) in videos with genuine pain und pretended pain separately. The author should further discuss this point that different factors (pain expression, etc) contribute to the differences between genuine vs. pretended pain.

      We thank the reviewer for the thoughtful consideration of different factors that might contribute to disentangle genuine pain vs. pretend pain. One thing we would like to address beforehand: to disentangle the specific contributors underlying the manipulation is not the main focus for the current study, as 1) our primary aim was to study the effects of the experimental manipulation as a whole; we thus used the three behavioral ratings mainly to collect additional information on and to interpret the expected effects of the manipulation, and 2) these factors (and their behavioral measures) are inherently (cor)related and hard to be disentangled precisely anyways, as mentioned by the reviewer and as shown by extensive previous research both by our and other groups.

      Action taken: Nonetheless, in the revised manuscript, we have now:

      1) discussed how different factors possibly interact and in this way contribute to the differences (in the modulatory effect) between genuine vs. pretended pain, in the Discussion (P. 11):

      “We speculate that a dynamic interaction between sensory-driven and control processes is underlying the modulatory effect: when individuals realized after an initial sensory-driven response to the facial expression that it was not genuinely expressing pain, control and appraisal processes led to a reappraisal of the triggered emotional response, and thus a dampening of the unpleasantness.”

      2) performed additional linear regression models and model comparison (see details in the response to comment #3) to investigate whether an interaction between behavioral measures could be a potential contributor to the modulatory effect of genuine pain and pretended pain; in short, the model without interactions is the winning model both for genuine pain and pretended pain.

      We have now discussed this result (P. 11):

      “Model comparison showed that the best model to explain the inhibitory effect with the behavioral ratings for both the genuine and pretended pain is the model without interactions between ratings. That is, if any behavioral rating contributed to the modulation of aIns to rSMG, the effect would be more likely coming from single ratings rather than their interactions. Specifically, we found [...]”

      We thank the reviewer for this suggestion for further analysis.

      We performed additional linear regression models (with and without interaction) and model comparison to explore whether any interaction between behavioral ratings heavily contributed to the modulatory effect. Results showed that the model without interaction was the most efficient model for both conditions.

      We report the additional analyses as follows:

      In the Methods section (P. 24-25): “Considering that interactions between behavioral ratings might contribute to the regression model, we tested five regression models (with and without interaction; see Supplementary Table 1) for both genuine pain and pretended pain. Results showed that for both genuine pain and pretend pain, the model without any interaction outperformed other models.”

      Supplementary Table 1. Model comparison of linear regression models with three behavioral ratings (independent variables) and the inhibitory effect (dependent variable) for genuine pain and pretended pain. Smaller AIC/BIC indicates better model fit. Results showed that M1 (without interaction; highlighted with underlining) was the best fitting model for both genuine pain and pretended pain.

      Accordingly, we now report the results of the winning model of the multiple regression analyses, instead of the original stepwise regression. These analyses found that only the rating of painful feelings in others was significant for genuine pain, while no significant effects whatsoever were found for pretended pain.

      Action taken:

      In the manuscript, we have made the following changes:

      1) We modified “stepwise linear regression” to “multiple linear regression” in the Methods, Results, and Figure 3 legend (P. 24, P. 7, and P. 37)

      2) We added the sentence “The results of the winning multiple regression model are reported in the Results section.” in the Methods (P. 25).

      3) We added the results of the multiple regression analyses for genuine pain and pretended pain, in the Results section (P. 7-8): “For the genuine pain condition, we find that the modulatory effect was significantly related to the rating of painful feelings in others (t = 2.317, p = 0.026) but not related to the rating of either painful expressions in others (t = -1.492, p = 0.144) or unpleasantness in self (t = 0.058, p = 0.954). For the pretended pain condition, none of the ratings was significantly related to the modulatory effect (Figure 3D).”

      4) We moved the results of the original stepwise regression analyses with behavioral ratings into the supplementary data (see Supplementary File 2):

      “Results of the stepwise regression analyses on modulatory effects and behavioral ratings are shown below. Note that this analysis reflects our original analysis approach; prompted by a reviewer comment, we however changed the analysis plan and performed and reported the findings of multiple regression analyses in the main text. Importantly, the conclusions of the two analysis approaches are consistent.

      To examine how the modulatory effects from the DCM were related to the behavioral ratings, we computed two stepwise linear regression models for each condition. The regression model was significant for the genuine pain condition (F model (1, 41) = 4.639, p = 0.037, R2 = 0.104), when painful feelings in others were added to the model and the other two ratings were excluded (B = 0.079, beta = 0.322, p = 0.037). However, the model was not significant for the pretended pain condition. The variance inflation factors (VIFs) for three ratings in both models were calculated to diagnose collinearity, showing no severe collinearity problem (all VIFs < 5; the smallest VIF =1.132 and the largest VIF = 4.387).”

      3) The multiple regression analyses revealed an association between the unpleasantness for the participants and the aIns, when accounting for the painful expression and the pain experienced by the other. This, however, does not reveal the specificity of the aIns for encoding the unpleasantness for the participants. It might well be that variance is shared in the association between the aIns and pain expression and pain by the other and unpleasantness for the participants, but simply strongest for unpleasantness. Such ambiguity could be resolved by additional multiple regressions of 1) pain expression (controlling for pain by the other and unpleasantness for the participants) and 2) pain for the other (controlling for pain expression and unpleasantness for the participants).

      We thank the reviewer for this comment. As an overall premise, please note that we would not want to claim that the aIns is specifically engaged in encoding affective activities without any engagement of other processes; instead, we are entirely aware that the aIns activation participates in a variety of affective and cognitive processes. Nonetheless, our original multiple regression models were performed as a second-level group analysis with all three ratings as independent variables. Results showed that only the rating of “unpleasantness in self” was significant, rather than all ratings that were universally influenced by domain-general factors.

      As the reviewer suggested, we additionally performed five multiple regression analyses with all possible orders of three behavioral measures to test whether the order matters. In the end, we found consistent results across all six regression analyses, suggesting that the selective correlation of aIns and the rating of unpleasantness in self was robust.

      Action taken: In the manuscript, we have:

      1) Modified “specifically” to “selectively” in the Results (P. 6).

      2) Added the content in the Methods (P. 22) “To test whether the order of entering ratings into the regression model influence the results, we performed five additional regression analyses with all possible orders of three ratings. The results were consistent across all six regression models, and we only showed the result for one regression (i.e., expression + feeling + unpleasantness) in the Results section.

      3) Modified the sentence in the Results (P. 6) “We found significant clusters in bilateral aIns, visual cortex, and cerebellum (Figure 2B); notably, when statistically accounting for ratings of painful expressions in others and painful feelings in others, all three clusters were exclusively explained by the ratings of self-unpleasantness.” to

      “We found significant clusters in bilateral aIns, visual cortex, and cerebellum that could be selectively explained by the ratings of self-unpleasantness and could not be explained by either the ratings of painful expressions in others or painful feelings in others (Figure 2B).”

      4) Modified the sentence in the Discussion (P. 9) “[...] but the increased activation in aIns was also selectively correlated with ratings of self-oriented unpleasantness (i.e., after statistically accounting for painful expressions and painful feelings in others) [...]” to

      “[...] but the increased activation in aIns was also selectively correlated with ratings of self- oriented unpleasantness and was not correlated with neither other-related painful expressions nor painful feelings in terms of the regression analysis [...]”

      and added the sentence “[...] (otherwise the increased aIns activation should also be explained by other behavioral ratings in the sense of shared influence by domain-general effects).”

      5) Modified the legend for Figure 3 (P. 37) “[...] revealed a positive correlation between the inhibitory effect and painful feelings in others (after accounting for the other two ratings) for genuine pain [...]” to

      “[...] revealed a positive correlation between the inhibitory effect and painful feelings in others and not with other two ratings for genuine pain [...]”

      4) Is the regression biased by the differences between conditions in the aIns in both fMRI signals and the ratings?

      We thank the reviewer for this comment. The reason that we compared the differences between conditions was mainly aimed to control for potential effects of perceptual salience. This aim was consistent for both fMRI signals and behavioral ratings. Note that, as the aIns activation and all behavioral ratings were higher for genuine pain as opposed to pretended pain, the current result could not be explained by an inverse effect (i.e., higher aIns activation and higher ratings of unpleasantness in self for pretended pain). Therefore, we do not consider it is problematic to use differences between conditions when performing the multiple regression analysis.

      Action taken:

      1) We have more explicitly specified “differences between conditions for” three behavioral ratings as independent variables for the multiple regression model in the Methods (P. 22).

      2) We added the sentence “The reason that we used the comparison between conditions for both brain signals and behavioral ratings was to control for potential effects of perceptual salience.” In the Methods (P. 22).

      5) The inclusion of the rSMG into the DCM model is not straight forward for me. It could have been based on previous literature, but then the aMCC should have been added as well. Furthermore, while the implication of the rSMG in distinction of self vs. others is established, the actual process in this experiment cannot be revealed. The authors state that the rSMG is involved in action observation or imitating emotions (page 9, line 200).

      We appreciate the reviewer’s comment that shows we seemed not to convey clearly why we have postulated a role of rSMG. We have now made our rationale more explicit and clear.

      Action taken:

      We have now:

      1) Modified the clarification of rSMG in the Discussion (P. 10): “The inferior parietal lobule was shown to be generally engaged in selective attention, action observation and imitating emotions (Bach et al., 2010; Pokorny et al., 2015; Gola et al., 2017; Hawco et al., 2017). Importantly, a specific role in affective rather than cognitive self-other distinction has been consistently identified for rSMG (Silani et al., 2013; Steinbeis et al., 2015; Bukowski et al., 2020). [...]”

      2) Added further clarification in the Discussion (P. 12) after the sentence “ [...] the correlation findings provide further evidence that the modulation of aIns to rSMG is implicated in encoding others’ emotional states,”

      with “which serves as a functional foundation for self-other processing [...] This regulation cannot be totally attributed to domain-general processes, otherwise other ratings should have also explained this variation.”

      Additionally, we agree re: aMCC, which we also predicted to play a role; but it was not the case at least in our data. In fact, we have already addressed this in the original version of the ms. (maintained on P. 7 of revised ms.): “Our original analysis plan was to include aMCC in the DCM analyses, but based on the fact that aMCC did not show as strong evidence (in terms of the multiple regression analysis) as the aIns of being involved in our task, we decided to use a more parsimonious DCM model without the aMCC.”

      Whether Results support their conclusions:

      The results support the distinction between the experimental conditions of genuine vs. pretended pain in the aIns and as a modulatory influence on the connectivity between the aIns and the rSMG. However, the authors aimed to test if genuine vs. pretended pain modulate regulatory influences from the aIns on the rSMG that are connected to self-other distinction (as proposed in the discussion page 8, line 170). Yet, any insights about self-other distinction are only inferred reversely, since there is no outcome that indicates how well participants distinguished between themselves and the other person. For example in the discussion the authors state that: " we thus propose that the higher rSMG engagement in genuine pain conditions reflects an increasing demand for self-other distinction imposed by the stronger shared negative affect experiences in this condition". This is not supported by the results. Furthermore, the title mentioned automated responses to pretended pain, which I could not understand, given the current results.

      We thank the reviewer for this comment, which somewhat follows up on similar arguments made and replied to in comment #1 above. Indeed, we fully agree that our design did not allow us to quantify self-other distinction, but that we inferred its engagement based on a strong theoretical motivation and the replication of previous findings on rSMG involvement during self-other distinction. As outlined above (cf. #1), this limitation was added to the revised manuscript.

      We also adjusted the way of reasoning for which we put the theoretical explanation ahead of the inference so that readers can better realize this statement is supported by stronger theoretical motivation in the Discussion (P. 10):

      First “Theoretical models of empathy [...]” and then “Concerning the current finding, we thus propose that [...]”

      We thank the reviewer for pointing out the potential ambiguity in the title. We agree it may be somewhat “imprecise”, and have revised the title accordingly (P. 1):

      “Neural dynamics between anterior insular cortex and right supramarginal gyrus dissociate genuine affect sharing from perceptual saliency of pretended pain”.

      Likely impact of the work on the field:

      These results are expected to advance the field, since they allow to disentangle visual expressions of pain from genuine pain in others. Thereby, this work could resolves the question about neural processes that are specific to pain in others beyond other salient cues.

      We thank the reviewer for this positive acclaim of our study.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper by Zhuang and colleagues seeks to answer an important clinical question by trying to come up with novel predictive biomarkers to predict high-risk T1 colorectal cancers that are at risk for nodal involvement. The current clinical features may both miss patients who underwent local therapy and who should have gone on to have surgery and patients for whom surgery was done based on risk features but perhaps unnecessarily. Using a training and validation set, they developed a protein-based classifier with an AUC of 0.825 based on mass spec analyses and proteomic analyses of patients with and without LN importantly linking biological rationale to the proteomic discoveries.

      In the training cohort, they took 105 candidate proteins reduced to 55, and did a validation in the training cohort first and then in two validation cohorts (one of which was prospective). They also looked at a 9-protein classifier which also performed well and furthermore looked at IHC for clinical ease.

      We appreciate the reviewers for the positive review and valuable comments. We have revised the manuscript according to the comments.

      Reviewer #2 (Public Review):

      The authors utilized a label-free LC-MS/MS analysis in formalin-fixed paraffin-embedded (FFPE) tumors from 143 LNM-negative and 78 LNM-positive patients with T1 CRC to identify protein biomarkers to determine LNM in T1 CRC.

      The authors used a fair number of clinical samples for the proteomics investigation. The experimental design is reasonable, and the statistical methods used in this manuscript are solid.

      The authors largely achieved their aims and the results supported their conclusion. The method used in this proteomic study can also be used for the proteomics analysis of other cancer types to identify diagnostic and prognostic biomarkers. In addition, the 9-marker panel has a potential clinical diagnosis practice in determining LNM in T1 CRC.

      Nevertheless, the authors need to justify their standards in selecting the biomarkers. For example, a p-value cut-off of 0.1 is not a usual criterion in similar proteomic studies. In addition, an identification frequency of 30% in patients seems not preferable for biomarker identification. The authors also need to justify the definition of fold change in the three subtypes with Kruskal-Walli's test. The authors need to describe more details on how they identified the 13 proteins from a 55-protein database. In addition, what is the connection between the final 9 proteins and the 19 proteins? What is the criterion to select 5 proteins for IHC validation from the 9 proteins?

      We appreciate the reviewers for the positive review and valuable comments. We have revised the manuscript according to the comments.

      The criteria and details of our standards in selecting are as follows.

      1) About p-value cut-off of 0.1:

      The purpose of this step is to screen appropriate variables for subsequent machine learning, rather than comparing differences between groups. The p-value cut-off of 0.1 is also a reliable strategy for variable selection in proteomics research. For example, it has been used in studies to predict the response to tumor necrosis factor-α inhibitors in rheumatoid arthritis (PMID: 28650254); the research about circadian clock in mouse liver (PMID: 29674717); the proteomic biomarker discovery in atherosclerosis (PMID: 15496433); and the proteomics and transcriptomics analysis in bacillus subtilis (PMID: 19948795).

      Based on reviewer’s suggestion, we used a cutoff of p-value 0.05 to screen for variables. In a training set of 70 lymph node-negative and 62 lymph node-positive cases, we identified 355 protein markers. We further incorporated these proteins into a lasso regression analysis and ultimately developed a lymph node metastasis prediction model consisting of 52 protein markers. We validated the model in VC1 and VC2, with AUC values of 1.000, 0.824, and 0.918 for the training set, VC1, and VC2, respectively, the predictive performance was slightly inferior to that of the model developed in this study (Figure 3- figure supplement 1C).

      2) About identification frequency of 30%:

      The analysis focusing on the proteins identified in > 30% of the samples has been applied in the previous published studies. For instance, the study of using proteomic biomarkers to build diagnostic model in lung cancer (PMID: 29576497), proteins identified in > 30% cohort samples were used for downstream analysis. In the study on the impact of Reptin on protein-protein interaction (PMID: 30862565) have demonstrated that proteins were required to have at least in > 30% of samples in order to be included in the proteome dataset.

      We compared our cohort with Jun Qin et al. and Bing Zhang et al., study published in Nature (PMID: 25043054), according to the number of the proteins detected in more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% of samples, respectively (Figure 2- figure supplement 1). The proportion proteins detected at different cutoff of the samples in the three cohort were, 10% (0.60, 0.94, 0.48), 20% (0.52, 0.83, 0.38), 30% (0.46,0.75, 0.31), 40% (0.41, 0.69, 0.26), 50% (0.37, 0.63, 0.23), 60% (0.33, 0.57, 0.18), 70% (0.29, 0.52, 0.15), 80% (0.25, 0.45, 0.11), 90% (0.19, 0.37, 0.11), 100% (0.07, 0.23, 0.10), respectively. The results showed that our cohort was reliable.

      To investigate the impacts of protein identification frequency cutoff in our study, we performed comparative pathway enrichment analysis of the differential expressed proteins (LNM+ vs. LNM-: p-value < 0.05, Wilcoxon rank-sum tests) under different observation percentiles, which were detected in more than 10%, more than 30% and more than 50% of samples, respectively. The results revealed that proteins from three thresholds (10%, 30% and 50%) represented similar pathway enrichment, such as mTOR signaling pathway and amino acid metabolism pathways were dominant in LNM-negative patients, coagulation cascades and Lipid metabolism pathways were overrepresented in the LNM-positive patients (Figure 2- figure supplement 1)

      Based on reviewer’s suggestion, we used a cutoff of 50% as identification frequency for variables. The lasso regression was carried out in training cohort (70 LNM-negative and 62 LNM-positive), with AUC of 0.999. The model was validated in VC1 and VC2, with AUC of 0.812 and 0.886, respectively. (Figure 2- figure supplement 1).

      3) About identification of the 13 proteins and the criterion to select 5 proteins for IHC validation from 55-protein database:

      The process of reducing the number of proteins from 55 to 13 and finally establishing a 5-molecule classifier based on the IHC score is as shown in Figure 1- figure supplement 2 in the revision. We first selected 19 proteins with [log2FC] > 1 or < -1 and p<0.05 (Wilcoxon rank-sum test) between the LNM-negative and LNM-positive in 221 patients from 55 proteins. Then we started looking for antibodies to these 19 proteins. We finally obtained 13 antibodies for further immunohistochemistry. We did immunohistochemical staining to the FFPE samples with 13 antibodies, and got the IHC score of each protein to build the single molecular prediction model by SPSS on ROC curve. For the principles of MS based proteomic and IHC stain are different, not all identified proteins can be converted into IHC. Finally, 5 IHC makers with p-value of IHC score less than 0.05 (Student’s t-test) were selected to build the IHC classifier using Logistic Regression. We also updated the description in the “Result” section in the revised manuscript (line 718-722, page 34-35 in the revision).

      4) About the connection between the final 9 proteins and the 19 proteins:

      To facilitate the clinical translation of the model, Multiple Logistic Regression was used to obtain 9 core proteins from 19 proteins (Figure 1- figure supplement 2 in the revision). We first performed logistic regression in 19 proteins, and eliminated 10 proteins with insignificant Estimate Std. Error z value (Pr (>|z|) > 0.05, and obtained 9 proteins with Pr(>|z|) < 0.05. After that, we carried out Binary Logistic Regression calculation again with 9 proteins to build the simplified classifier. We also updated the description in the “Materials and methods” section in the revised manuscript (line 1092, page 51 in the revision).

      5) About the definition of fold change in the three subtypes with Kruskal-Walli's test:

      The fold change in the three subtypes is the ratio of the mean of the expressions in each group (well to moderately differentiated adenocarcinoma, poorly differentiated adenocarcinoma and mucinous adenocarcinoma) to the mean of the other two group. Kruskal-Walli's test was performed between three subtypes.

      We also updated the description in the “Result” section in the revised manuscript (line 506-517, page 25 in the revision), and “Figure 1- figure supplement 2H in the revision”.

      Reviewer #3 (Public Review):

      This work provides a proteomic analysis of 132 early-stage (pT1) colorectal cancers (CRC) to attempt to identify proteins (or a signature pattern thereof) that might be used to predict the patient risk of lymph node metastases (LNM) and potentially stratify patients for further treatment or surveillance. The generated dataset is extensive and the methods appear solid. The work identifies a 55-protein signature that is strongly predictive of LNM in the training cohort and two validation cohorts and then generates two simplified classifiers: a 9-protein proteomic and a 5-protein immunohistochemical classifier. These also perform very well in predicting LNM. Loss of the small GTPase RHOT2 is identified as a poor prognostic factor and validated in a migration assay. The findings could allow better prognostication in CRC and, if confirmed and better validated and contextualized, might impact patient care.

      Strengths:

      A large training cohort of resected early-stage (pT1M0) CRCs was analyzed by rigorous methods including careful quantitative analysis. The data generated are unbiased and potentially useful. A number of proteins are found to be different between CRCs with and without lymph node metastases, which are used to train a machine learning model that performs flawlessly in predicting LNM in the training cohort and very well in predicting LNM in two validation cohorts. The authors then develop two simplified classifiers that might be more readily extended into clinical care: a 9-protein proteomic assay and a 5-protein immunohistochemical assay; both of these also perform well in predicting LNM. Because LNM is a key prognostic factor, and colectomy (which includes removal of lymph nodes needed to assess LNM) carries significant risk and morbidity, particularly in rectal cancer, classifiers like these are potentially interesting. Finally, the authors identify the loss of expression of RHOT2 as a novel prognostic factor.

      Weaknesses:

      Major points:

      The data are limited by a number of assumptions about metastasis, minimal contextualization of the results, and claims that are too strong given the data. Critically, the authors use the presence or absence of LNM as the study's only outcome; while LNM is a key predictor in CRC, it is uncommon in T1 CRC (generally 3-10%, 12% in this study), stochastic, inefficient, and incompletely identified by histologic evaluation. Larger resection (here, colectomy) removes both identified and occult LNM, which is probably best studied in randomized trials of lymphadenectomy in Japanese gastric cancer cohorts and should be better discussed. Critically, patient survival or disease-free survival would be more relevant outcomes. Further, absent longer-term data, many patients without identified LNM might nonetheless be high-risk and skew the cohorts. It is also not clear whether these findings would be generalizable to other early-stage colon cancers.

      The data are also not correlated with the genetics of the cases, which were not discussed.

      The results would benefit from the inclusion of standard-of-care MSI status. The classifiers would also be much more impactful if they were generalizable beyond T1 CRCs; this could be readily tested in public datasets.

      The authors explain the data as mechanistic, but, aside from one experiment modulating RHOT2 levels, they are fundamentally correlative and should be described as such.

      Although they focused on areas containing >80% tumor as judged by the reading pathologist, it is unclear whether the identified proteomic changes originate from the tumor or the microenvironment.

      The authors fail to properly contextualize the results or overstate the novelty of their study. A number of examples - the study is claimed as "the first proteomic study of T1 CRC" and "the first comprehensive proteomics study to focus on LNM in patients with submucosal T1 CRCs"; neither of these appears to be true, for example, Steffen et al. (Journal of Proteome Research, 2021, reference 18) may satisfy both of these, although the numbers are smaller. Many other results are reported without context, for example, proteomic characterization of mucinous carcinomas has been performed previously, a modest correlation in mucinous carcinoma is ascribed a large mechanistic role, and PDPN is discussed but is not contextualized as a protein that has been well-studied in the context of metastasis.

      The data on RHOT2 are promising but very preliminary. RHOT2 is described as ubiquitous in colorectal cancer cell lines; a brief search in Human Protein Atlas shows RHOT2 RNA and proteins are ubiquitously expressed throughout the body. While its loss appears potentially prognostic, it is unclear whether this is simply a surrogate for other features, such as loss of differentiation state, and whether this is unique to CRC; multivariate analysis would be important.

      We appreciate the reviewer for the constructive and insightful comments, which help to improve the quality of this manuscript. Here, we summarized the reviewer’s comments as following: (1) Lack of longer-term data and micrometastasis; (2) test the classifier in public datasets; (3) inclusion of standard genetics and gene alterations; (4) about the tumor purity of all tumor samples and whether the results were influenced by the tumor microenvironment; (5) contextualize the results; (6) multivariate analysis of RHOT2.

      1) Lack of longer-term data and micrometastasis:

      Thank the reviewer for the comments. We fully acknowledge the limitations of our study, including the uncertainty associated with the detection of lymph node micrometastasis and the lack of long-term survival data, which can impact the strength of our conclusions. We agree that LNM is a key predictor in CRC and that it is uncommon in T1 CRC, with a reported incidence of 3-10%. We acknowledge that larger resections, such as colectomy, are generally recommended for patients with T1 CRC with LNM due to the potential risk of metastasis. However, our study aimed to establish a predictive model for LNM in T1 CRC, which could potentially help guide clinical decision-making on whether additional surgery is needed after endoscopic resection, according to the current NCCN guidelines.

      We have taken following methods to address these limitations:

      • We matched propensity-score of patients to reduce confounding biases in our training cohort, and patients were prospectively enrolled in our validation cohort, which was designed as a single-blinded prospective study to enhance the rigor and reliability of our findings.

      • For the influence of micrometastases in our study. According to reviewer's suggestion, we discussed the reports related to lymph nodes micrometastases in Japanese gastric cancer cohorts (PMID: 17377930, 9070482), and at the same time, we consulted the articles about micrometastases in T1 CRC (PMID: 17661146, 16412600). There were about 5% pT1N0 gastric cancer patients have ITCs in LN, and 10% in pT1Nx CRC. The effect of MMs on prognosis in pT1N0 CRC is still unclear. The present of ITCs/MMs in LN may explain why there are nearly 13% (29 of 221) LNM-negative patients were classified into high-risk group by the prediction model in our study.

      We have also added a section to the “Discussion” in the revised manuscript to discuss the potential impact of these limitations on the interpretation of our findings (line 856-873, page 41) in the revision, as follow:

      “In this study, to ensure the accuracy of LN status of the enrolled patients, the dissected number of LN in all patients including both surgical resection and ESD was more than 12. However, the longer-term follow-up data, including DFS, PFS, etc., are not available, due to limitations in sample collection time and the prognosis of such patients needs to be tracked over long periods of time, and may impact the strength of our conclusions. To address this limitation, we used propensity-score matching to reduce confounding biases in our training cohort. Patients were prospectively enrolled in our validation cohort (VC2), which was designed as a single-blinded prospective study to enhance the rigor and reliability of our findings. Furthermore, the presence of isolated tumor cells (ITCs) or micrometastases (MMs) within regional LN are not considered, due to conventional histopathologic examination cannot detected them. According to previous studies, there were about 5% pT1N0 gastric cancer patients have ITCs in LN, and 10% in pT1Nx CRC. The effect of MMs on prognosis in pT1N0 CRC is still unclear. The present of ITCs/MMs in LN may explain why there are nearly 13% (29 of 221) LNM-negative patients were classified into high-risk group by the prediction model in our study. Our study would provide a valuable database and could help for clinical decision-making in the context of T1 CRC. We will continuously follow the prognosis of the patients, and the ITCs/MMs in LN also need to be further validated in the future studies.”

      In conclusion, we appreciate reviewer’s comments and acknowledge the limitations of our study. We believe that our study provides valuable insights into the development of a predictive model for LNM in T1 CRC, which could potentially aid in clinical decision-making according to the current NCCN guidelines.

      2) Test the classifier in public datasets:

      According to reviewer’s suggestions, we tested our classifier in two different public datasets, including the colon and rectal cancer study from CPTAC published in Nature (PMID: 25043054), and the metastatic colorectal cancer study published in Cancer Cell (PMID: 32888432). The detail was further discussed in “point-to-point responses R3 Q2.”.

      3) Standard genetics and gene alterations:

      According to reviewer’s suggestions, we assessed MSI status and CRC-associated gene mutations (RAS, BRAF and PIK3CA) in our cohort. The detail was further discussed in “point-to-point responses R3 Q1.”

      4) The influence of microenvironment:

      We apologized for not explaining it clearly. To the question of whether the differences between two groups (LNM+ and LNM-) are caused by tumor microenvironment or the tumor tissues, we firstly, used xCell (PMID: 29141660) to study the composition of the tumor microenvironment (Figure2-source data 4 in the revision). The results showed that there was no difference in the tumor microenvironment between the LNM-positive and negative groups (P > 0.05, Wilcoxon rank-sum test) (Figure RL1A). However, when we compared the xCell algorism-based cell deconvolution results between the LNM-positive and -negative groups, we found 8 microenvironment associated cell features differed in two groups (p<0.05) (Figure RL1B). LNM-positive patients were featured with Chondrocytes and Th1 cells. And the remaining 6 features are all high in LNM-negative patients, including, B cells, cDC, Myocytes, etc. Correspondingly, 7 immune cell markers were also observed to be significantly different between the two groups (Log2FC>1 or <-1, P > 0.05, Wilcoxon rank-sum test) (Figure RL1C).

      Secondly, we checked the expression profile of the signature proteins detected in our study by The Human Protein Atlas (HPA). Among 9404 identified proteins, 7852 (83.4%) have HPA’s CRC IHC staining data, and 6249 (79.6%) showed medium to high tumor-specific staining in CRC samples (Figure RL1D). Of the signature proteins up-regulated in LNM-positive patients (LNM+ vs. LNM-: log2FC > 1 and p<0.05, Wilcoxon rank-sum test), 76 of 84 (90.5%) have IHC staining data in HPA, and 63 (82.9%) showed medium to high tumor-specific staining in CRC samples (Figure RL1E). For specific proteins of LNM-negative patients (LNM+ vs. LNM-: log2FC <-1 and p<0.05, Wilcoxon rank-sum test), 72 of 82 (87.8%) have IHC staining data in HPA, and 60 (83.3%) showed medium to high tumor-specific staining in CRC samples (Figure RL1F).

      Finally, we reviewed again all H&E-stained slides of tumor tissues of patients involved in the study, and supplemented tumor purity values of tumor samples of all the patients in Figure1-source data 1. We compared the tumor purity between the LNM-positive (with average 87.75%) and negative patients (with average 88.27%). The result showed there was no difference between the two groups (P = 0.46, Student’s t-test), demonstrating the high purity and quality of the tumor tissues. (Figure1-supplementary figure 1J in the revision).

      These results indicate that, in our study the differences between LNM-positive and LNM-negative groups are mainly caused by tumor tissues. However, the tumor microenvironment may also play a critical but not direct role in T1 CRC development and progression.

      Figure RL1. A. Comparison of xCell scores of immune and microenvironment between the LNM-negative group (n= 143) and LNM-positive group (n= 78). B&C. Immune/stromal signatures identified from xCell, together with derived relative abundance of immune and stromal cell types. D, E, F. Identified signature proteins (D), LNM-positive group up-regulated proteins (E) and LNM-negative group up-regulated proteins (F) were mostly validated by HPA IHC Staining Data. G. Barplot for tumor purity between LNM-negative and -positive patients.

      5) Contextualize the results:

      According to the reviewer’s advice, we have made corresponding adjustments in the revised manuscript, for example:

      • “We have made a comprehensive proteomic study of T1 CRC and provides a reliable data source for future research. “(line 342, page 17 in the revision)

      -“Here, we present a comprehensive proteomic study to focus on LNM in patients with submucosal T1 CRCs.” (line 788, page 37 in the revision)

      With regard to the problem of results are reported without context, we have provided supplementary descriptions of the context of the results in the “Result” section of the revised manuscript, for example:

      • “Mucinous adenocarcinoma was considered to be a significant risk factor of LNM in T1 CRC (PMID: 31620912).” (line 498, page 24 in the revision)

      • “Mucinous adenocarcinoma of the colorectal is a lethal cancer with unknown molecular etiology and a high propensity to lymph node metastasis. Previous proteomic studies on mucinous adenocarcinoma have found the proteins associated with treatment response in rectal mucinous adenocarcinoma and mechanisms of metastases in mucinous salivary adenocarcinoma.” (PMID: 34990823, 28249646) (line 534-538, page 26 in the revision)

      • “Previous studies have shown that PDPN expression correlated with LNM in numerous cancers, especially in early oral squamous cell carcinomas.” (PMID: 21105028).” (line 570, page 27 in the revision)

      6) Multivariate analysis of RHOT2:

      RHOT2 and its paralog RHOT1 plays an important role in mitochondrial trafficking (PMID: 16630562). Although the function of RHOT2 in cancer is still unknown, the expression of RHOT1 affects metastasis in a variety of tumors, including pancreatic cancer (PMID: 26101710), gastric cancer (PMID: 35170374), small cell lung cancer (PMID: 33515563), etc. In addition, previous studies have found that Myc regulation of mitochondrial trafficking through RHOT1 and RHOT2 enables tumor cell motility and metastasis (PMID: 31061095).

      As shown in Figure 4, in our analysis of previous version, we found RHOT2 was significant down-regulated (Log2FC=-1.35; p=0.003, Wilcoxon rank-sum test) in LNM-positive patients compared with LNM-negative patients in our T1 CRC cohort and the low level of RHOT2 is related to low overall survival of patients with colon cancer in TCGA cohort. Knockdown of RHOT2 expression could markedly enhance the migration ability of colon cancer cells.

      In order to further explore the influence of RHOT2 on T1 CRC LNM, in addition to the previous results, we carried out the following analysis as shown in Figure4 in the revision.

      We, firstly, calculated the correlations between the expression of RHOT2 and other proteins in our cohort (Figure 4). 1,508 proteins were correlated significantly (P < 0.05, Spearman) with RHOT2, and 1,354 proteins showed a positive correlation (coefficient >0) with RHOT2, and 154 proteins were negatively correlated with RHOT2 (coefficient <0). However, when we performed GSEA in RHOT2-associated proteins to identify biological signatures impacted by RHOT2, most of the obtained pathways (p<0.01) showed NES less than 0, which means these pathways were mainly enriched in RHOT2-negative-correlated group, only “mitochondrion” (GOCC) had a positive correlation (Figure 4). As we known RHOT2 is an important protein involved in the regulation of mitochondrial dynamics and mitophagy (PMID: 16630562). This result indicates that the involvement of RHOT2 in regulation of mitochondrial function might contribute to the pathogenesis of metastasis in cancer, especially in early-stage CRC. Consistent with the previous results, RHOT2-negative-correlated group was significantly enriched for EMT (HALLMARK) and complement and coagulation cascades pathways. Proteins up-regulated in LNM-positive group (LNM+ vs. LNM-: Log2FC >0; p<0.05, Wilcoxon rank-sum test) were negatively correlated with RHOT2(p < 0.05, coefficient<0, Spearman), including CAP2, COL6A3, COL6A2, TNC, DPYSL3, PCOLCE and BGN in pathway EMT; and GUCY1B3, VWF and F13A1 in pathway complement and coagulation cascades (Figure 2E, L; Figure 4D in the revision). ECM, focal adhesion and Dilated cardiomyopathy (DCM) pathways were also enriched in negative-correlated group. Degradation of RHOT2 has already been reported to be associated with DCM (PMID: 31455181). Overall, combined with the previous results, RHOT2 may play an important role in T1 CRC LNM (Figure 4D in the revision.).

      As reviewer mentioned the data on RHOT2 are promising, but the understanding of it is preliminary. More analytical studies and experiments are needed in our future researches to understand the specific role and mechanism of RHOT2 in the process of tumor metastasis. In the revision, we discussed these limitations of our research.

    1. Author Response

      Reviewer #1 (Public Review):

      The central claim that the R400Q mutation causes cardiomyopathy in humans require(s) additional support.

      We regret that the reviewer interpreted our conclusions as described. Because of the extreme rarity of the MFN2 R400Q mutation our clinical data are unavoidably limited and therefore insufficient to support a conclusion that it causes cardiomyopathy “in humans”. Importantly, this is a claim that we did not make and do not believe to be the case. Our data establish that the MFN2 R400Q mutation is sufficient to cause lethal cardiomyopathy in some mice (Q/Q400a; Figure 4) and predisposes to doxorubicin-induced cardiomyopathy in the survivors (Q/Q400n; new data, Figure 7). Based on the clinical association we propose that R400Q may act as a genetic risk modifier in human cardiomyopathy.

      To avoid further confusion we modified the manuscript title to “A human mitofusin 2 mutation can cause mitophagic cardiomyopathy” and provide a more detailed discussion of the implications and limitations of our study on page 11).

      First, the claim of an association between the R400Q variant (identified in three individuals) and cardiomyopathy has some limitations based on the data presented. The initial association is suggested by comparing the frequency of the mutation in three small cohorts to that in a large database gnomAD, which aggregates whole exome and whole genome data from many other studies including those from specific disease populations. Having a matched control population is critical in these association studies.

      We have added genotyping data from the matched non-affected control population (n=861) of the Cincinnati Heart study to our analyses (page 4). The conclusions did not change.

      For instance, according to gnomAD the MFN2 Q400P variant, while not observed in those of European ancestry, has a 10-fold higher frequency in the African/African American and South Asian populations (0.0004004 and 0.0003266, respectively). If the authors data in table one is compared to the gnomAD African/African American population the p-value drops to 0.029262, which would not likely survive correction for multiple comparison (e.g., Bonferroni).

      Thank you for raising the important issue of racial differences in mutant allele prevalence and its association with cardiomyopathy. Sample size for this type of sub-group analysis is limited, but we are able to provide African-derived population allele frequency comparisons for both the gnomAD population and our own non-affected control group.

      As now described on page 4, and just as with the gnomAD population we did not observe MFN2 R400Q in any Caucasian individuals, either cardiomyopathy or control. Its (heterozygous only) prevalence in African American cardiomyopathy is 3/674. Thus, the R400Q minor allele frequency of 3/1,345 in AA cardiomyopathy compares to 10/24,962 in African gnomAD, reflecting a statistically significant increase in this specific population group (p=0.003308; Chi2 statistic 8.6293). Moreover, all African American non-affected controls in the case-control cohort were wild-type for MFN2 (0/452 minor alleles).

      (The source and characteristics of the subjects used by the authors in Table 1 is not clear from the methods.)

      The details of our study cohorts were inadvertently omitted during manuscript preparation. As now reported on pages 3 and 4, the Cincinnati Heart Study is a case-control study consisting of 1,745 cardiomyopathy (1,117 Caucasian and 628 African American) subjects and 861 non-affected controls (625 Caucasian and 236 African American) (Liggett et al Nat Med 2008; Matkovich et al JCI 2010; Cappola et al PNAS 2011). The Houston hypertrophic cardiomyopathy cohort [which has been screened by linkage analysis, candidate gene sequencing or clinical genetic testing) included 286 subjects (240 Caucasians and 46 African Americans) (Osio A et al Circ Res 2007; Li L et al Circ Res 2017).

      Relatedly, evaluation in a knock-in mouse model is offered as a way of bolstering the claim for an association with cardiomyopathy. Some caution should be offered here. Certain mutations have caused a cardiomyopathy in mice when knocked in have not been observed in humans with the same mutation. A recent example is the p.S59L variant in the mitochondrial protein CHCHD10, which causes cardiomyopathy in mice but not in humans (PMID: 30874923). While phenocopy is suggestive there are differences in humans and mice, which makes the correlation imperfect.

      We understand that a mouse is not a man, and as noted above we view the in vitro data in multiple cell systems and the in vivo data in knock-in mice as supportive for, not proof of, the concept that MFN2 R400Q can be a genetic cardiomyopathy risk modifier. As indicated in the following responses, we have further strengthened the case by including results from 2 additional, previously undescribed human MFN2 mutation knock-in mice.

      Additionally, the argument that the Mfn2 R400Q variant causes a dominant cardiomyopathy in humans would be better supported by observing of a cardiomyopathy in the heterozygous Mfn2 R400Q mice and not just in the homozygous Mfn2 R400Q mice.

      We are intrigued that in the previous comment the reviewer warns that murine phenocopies are not 100% predictive of human disease, and in the next sentence he/she requests that we show that the gene dose-phenotype response is the same in mice and humans. And, we again wish to note that we never argued that MFN2 R400Q “causes a dominant cardiomyopathy in humans.” Nevertheless, we understand the underlying concerns and in the revised manuscript we present data from new doxorubicin challenge experiments comparing cardiomyopathy development and myocardial mitophagy in WT, heterozygous, and surviving (Q/Q400n) homozygous Mfn2 R400Q KI mice (new Figure 7, panels E-G). Homozygous, but not heterozygous, R400Q mice exhibited an amplified cardiomyopathic response (greater LV dilatation, reduced LV ejection performance, exaggerated LV hypertrophy) and an impaired myocardial mitophagic response to doxorubicin. These in vivo data recapitulate new in vitro results in H9c2 rat cardiomyoblasts expressing MFN2 R400Q, which exhibited enhanced cytotoxicity (cell death and TUNEL labelling) to doxorubicin associated with reduced reactive mitophagy (Parkin aggregation and mitolysosome formation) (new Figure 7, panels A-D). Thus, under the limited conditions we have explored to date we do not observe cardiomyopathy development in heterozygous Mfn2 R400Q KI mice. However, we have expanded the association between R400Q, mitophagy and cardiomyopathy thereby providing the desired additional support for our argument that it can be a cardiomyopathy risk modifier.

      Relatedly, it is not clear what the studies in the KI mouse prove over what was already known. Mfn2 function is known to be essential during the neonatal period and the authors have previously shown that the Mfn2 R400Q disrupts the ability of Mfn2 to mediate mitochondrial fusion, which is its core function. The results in the KI mouse seem consistent with those two observations, but it's not clear how they allow further conclusions to be drawn.

      We strenuously disagree with the underlying proposition of this comment, which is that “mitochondrial fusion (is the) core function” of mitofusins. We also believe that our previous work, alluded to but not specified, is mischaracterized.

      Our seminal study defining an essential role for Mfn2 for perinatal cardiac development (Gong et al Science 2015) reported that an engineered MFN2 mutation that was fully functional for mitochondrial fusion, but incapable of binding Parkin (MFN2 AA), caused perinatal cardiomyopathy when expressed as a transgene. By contrast, another engineered MFN2 mutant transgene that potently suppressed mitochondrial fusion, but constitutively bound Parkin (MFN2 EE) had no adverse effects on the heart.

      Our initial description of MFN2 R400Q and observation that it exhibited impaired fusogenicity (Eschenbacher et al PLoS One 2012) reported results of in vitro studies and transgene overexpression in Drosophila. Importantly, a role for MFN2 in mitophagy was unknown at that time and so was not explored.

      A major point both of this manuscript and our work over the last decade on mitofusin proteins has been that their biological importance extends far beyond mitochondrial fusion. As introduced/discussed throughout our manuscript, MFN2 plays important roles in mitophagy and mitochondrial motility. Because this central point seems to have been overlooked, we have gone to great lengths in the revised manuscript to unambiguously show that impaired mitochondrial fusion is not the critical functional aspect that determines disease phenotypes caused by Mfn2 mutations. To accomplish this we’ve re-structured the experiments so that R400Q is compared at every level to two other natural MFN2 mutations linked to a human disease, the peripheral neuropathy CMT2A. These comparators are MFN2 T105M in the GTPase domain and MFN2 M376A/V in the same HR1 domain as MFN2 R400Q. Each of these human MFN2 mutations is fusion-impaired, but the current studies reveal that that their spectrum of dysfunction differs in other ways as summarized in Author response table 1:

      Author response table 1.

      We understand that it sounds counterintuitive for a mutation in a “mitofusin” protein to evoke cardiac disease independent of its appellative function, mitochondrial fusion. But the KI mouse data clearly relate the occurrence of cardiomyopathy in R400Q mice to the unique mitophagy defect provoked in vitro and in vivo by this mutation. We hope the reviewer will agree that the KI models provide fresh scientific insight.

      Additionally, the authors conclude that the effect of R400Q on the transcriptome and metabolome in a subset of animals cannot be explained by its effect on OXPHOS (based on the findings in Figure 4H). However, an alternative explanation is that the R400Q is a loss of function variant but does not act in a dominant negative fashion. According to this view, mice homozygous for R400Q (and have no wildtype copies of Mfn2) lack Mfn2 function and consequently have an OXPHOS defect giving rise to the observed transcriptomic and metabolomic changes. But in the rat heart cell line with endogenous rat Mfn2, exogenous of the MFN2 R400Q has no effect as it is loss of function and is not dominant negative.

      Our results in the original submission, which are retained in Figures 1D and 1E and Figure 1 Figure Supplement 1 of the revision, exclude the possibility that R400Q is a functional null mutant for, but not a dominant suppressor of, mitochondrial fusion. We have added additional data for M376A in the revision, but the original results are retained in the main figure panels and a new supplemental figure:

      Figure 1D reports results of mitochondrial elongation studies (the morphological surrogate for mitochondrial fusion) performed in Mfn1/Mfn2 double knock-out (DKO) MEFs. The baseline mitochondrial aspect ratio in DKO cells infected with control (b-gal containing) virus is ~2 (white bar), and increases to ~6 (i.e. ~normal) by forced expression of WT MFN2 (black bar). By contrast, aspect ratio in DKO MEFs expressing MFN2 mutants T105M (green bar), M376A and R400Q (red bars in main figure), R94Q and K109A (green bars in the supplemental figure) is only 3-4. For these results the reviewer’s and our interpretation agree: all of the MFN2 mutants studied are non-functional as mitochondrial fusion proteins.

      Importantly, Figure 1E (left panel) reports the results of parallel mitochondrial elongation studies performed in WT MEFs, i.e. in the presence of normal endogenous Mfn1 and Mfn2. Here, baseline mitochondrial aspect ratio is already normal (~6, white bar), and increases modestly to ~8 when WT MFN2 is expressed (black bar). By comparison, aspect ratio is reduced below baseline by expression of four of the five MFN2 mutants, including MFN2 R400Q (main figure and accompanying supplemental figure; green and red bars). Only MFN2 M376A failed to suppress mitochondrial fusion promoted by endogenous Mfns 1 and 2. Thus, MFN2 R400Q dominantly suppresses mitochondrial fusion. We have stressed this point in the text on page 5, first complete paragraph.

      Additionally, as the authors have shown MFN2 R400Q loses its ability to promote mitochondrial fusion, and this is the central function of MFN2, it is not clear why this can't be the explanation for the mouse phenotype rather than the mitophagy mechanism the authors propose.

      Please see our response #7 above beginning “We strenuously disagree...”

      Finally, it is asserted that the MFN2 R400Q variant disrupts Parkin activation, by interfering with MFN2 acting a receptor for Parkin. The support for this in cell culture however is limited. Additionally, there is no assessment of mitophagy in the hearts of the KI mouse model.

      The reviewer may have overlooked the studies reported in original Figure 5, in which Parkin localization to cultured cardiomyoblast mitochondria is linked both to mitochondrial autophagy (LC3-mitochondria overlay) and to formation of mito-lysosomes (MitoQC staining). These results have been retained and expanded to include MFN2 M376A in Figure 6 B-E and Figure 6 Figure Supplement 1 of the revised manuscript. Additionally, selective impairment of Parkin recruitment to mitochondria was shown in mitofusin null MEFs in current Figure 3C and Figure 3 Figure Supplement 1, panels B and C.

      The in vitro and in vivo doxorubicin studies performed for the revision further strengthen the mechanistic link between cardiomyocyte toxicity, reduced parkin recruitment and impaired mitophagy in MFN2 R400Q expressing cardiac cells: MFN2 R400Q-amplified doxorubicin-induced H9c2 cell death is associated with reduced Parkin aggregation and mitolysosome formation in vitro, and the exaggerated doxorubicin-induced cardiomyopathic response in MFN2 Q/Q400 mice was associated with reduced cardiomyocyte mitophagy in vivo, measured with adenoviral Mito-QC (new Figure 7).

      Reviewer #2 (Public Review):

      In this manuscript, Franco et al show that the mitofusin 2 mutation MFN2 Q400 impaires mitochondrial fusion with normal GTPase activity. MFN2 Q400 fails to recruit Parkin and further disrupts Parkin-mediated mitophagy in cultured cardiac cells. They also generated MFN2 Q400 knock-in mice to show the development of lethal perinatal cardiomyopathy, which had an impairment in multiple metabolic pathways.

      The major strength of this manuscript is the in vitro study that provides a thorough understanding in the characteristics of the MFN2 Q400 mutant in function of MFN2, and the effect on mitochondrial function. However, the in vivo MFN2 Q/Q400 knock-in mice are more troubling given the split phenotype of MFN2 Q/Q400a vs MFN2 Q/Q400n subtypes. Their main findings towards impaired metabolism in mutant hearts fail to distinguish between the two subtypes.

      Thanks for the comments. We do not fully understand the statement that “impaired metabolism in mutant hearts fails to distinguish between the two (in vivo) subtypes.” The data in current Figure 5 and its accompanying figure supplements show that impaired metabolism measured both as metabolomic and transcriptomic changes in the subtypes (orange Q400n vs red Q400a in Figure 5 panels A and D) are reflected in the histopathological analyses. Moreover, newly presented data on ROS-modifying pathways (Figure 5C) suggest that a central difference between Mfn2 Q/Q400 hearts that can compensate for the underlying impairment in mitophagic quality control (Q400n) vs those that cannot (Q400a) is the capacity to manage downstream ROS effects of metabolic derangements and mitochondrial uncoupling. Additional support for this idea is provided in the newly performed doxorubicin challenge experiments (Figure 7), demonstrating that mitochondrial ROS levels are in fact increased at baseline in adult Q400n mice.

      While the data support the conclusion that MFN2 Q400 causes cardiomyopathy, several experiments are needed to further understand mechanism.

      We thank the reviewer for agreeing with our conclusion that MFN2 Q400 can cause cardiomyopathy, which was the major issue raised by R1. As detailed below we have performed a great deal of additional experimentation, including on two completely novel MFN2 mutant knock-in mouse models, to validate the underlying mechanism.

      This manuscript will likely impact the field of MFN2 mutation-related diseases and show how MFN2 mutation leads to perinatal cardiomyopathy in support of previous literature.

      Thank you again. We think our findings have relevance beyond the field of MFN2 mutant-related disease as they provide the first evidence (to our knowledge) that a naturally occurring primary defect in mitophagy can manifest as myocardial disease.

    1. Author Response:

      Evaluation Summary:

      This study investigates the mechanisms by which distributed systems control rhythmic movements of different speeds. The authors train an artificial recurrent neural network to produce the muscle activity patterns that monkeys generate when performing an arm cycling task at different speeds. The dominant patterns in the neural network do not directly reflect muscle activity and these dominant patterns do a better job than muscle activity at capturing key features of neural activity recorded from the monkey motor cortex in the same task. The manuscript is easy to read and the data and modelling are intriguing and well done.

      We thank the editor and reviewers for this accurate summary and for the kind words.

      Further work should better explain some of the neural network assumptions and how these assumptions relate to the treatment of the empirical data and its interpretation.

      The manuscript has been revised along these lines.

      Reviewer #1 (Public Review):

      In this manuscript, Saxena, Russo et al. study the principles through which networks of interacting elements control rhythmic movements of different speeds. Typically, changes in speed cannot be achieved by temporally compressing or extending a fixed pattern of muscle activation, but require a complex pattern of changes in amplitude, phase, and duty cycle across many muscles. The authors train an artificial recurrent neural network (RNN) to predict muscle activity measured in monkeys performing an arm cycling task at different speeds. The dominant patterns of activity in the network do not directly reflect muscle activity. Instead, these patterns are smooth, elliptical, and robust to noise, and they shift continuously with speed. The authors then ask whether neural population activity recorded in motor cortex during the cycling task closely resembles muscle activity, or instead captures key features of the low-dimensional RNN dynamics. Firing rates of individual cortical neurons are better predicted by RNN than by muscle activity, and at the population level, cortical activity recapitulates the structure observed in the RNN: smooth ellipses that shift continuously with speed. The authors conclude that this common dynamical structure observed in the RNN and motor cortex may reflect a general solution to the problem of adjusting the speed of a complex rhythmic pattern. This study provides a compelling use of artificial networks to generate a hypothesis on neural population dynamics, then tests the hypothesis using neurophysiological data and modern analysis methods. The experiments are of high quality, the results are explained clearly, the conclusions are justified by the data, and the discussion is nuanced and helpful. I have several suggestions for improving the manuscript, described below.

      This is a thorough and accurate summary, and we appreciate the kind comments.

      It would be useful for the authors to elaborate further on the implications of the study for motor cortical function. For example, do the authors interpret the results as evidence that motor cortex acts more like a central pattern generator - that is, a neural circuit that transforms constant input into rhythmic output - and less like a low-level controller in this task?

      This is a great question. We certainly suspect that motor cortex participates in all three key components: rhythm generation, pattern generation, and feedback control. The revised manuscript clarifies how the simulated networks perform both rhythm generation and muscle-pattern generation using different dimensions (see response to Essential Revisions 1a). Thus, the stacked-elliptical solution is consistent with a solution that performs both of these key functions.

      We are less able to experimentally probe the topic of feedback control (we did not deliver perturbations), but agree it is important. We have thus included new simulations in which networks receive (predictable) sensory feedback. These illustrate that the stacked-elliptical solution is certainly compatible with feedback impacting the dynamics. We also now discuss that the stacked-elliptical structure is likely compatible with the need for flexible responses to unpredictable perturbations / errors:

      "We did not attempt to simulate feedback control that takes into account unpredictable sensory inputs and produces appropriate corrections (Stavisky et al. 2017; Pruszynski and Scott 2012; Pruszynski et al. 2011; Pruszynski, Omrani, and Scott 2014). However, there is no conflict between the need for such control and the general form of the solution observed in both networks and cortex. Consider an arbitrary feedback control policy: 𝑧 = 𝑔 𝑐 (𝑡, 𝑢 𝑓 ) where 𝑢 is time-varying sensory input arriving in cortex and is a vector of outgoing commands. The networks we 𝑓 𝑧 trained all embody special cases of the control policy where 𝑢 is either zero (most simulations) or predictable (Figure 𝑓 9) and the particulars of 𝑧 vary with monkey and cycling direction. The stacked-elliptical structure was appropriate in all these cases. Stacked-elliptical structure would likely continue to be an appropriate scaffolding for control policies with greater realism, although this remains to be explored."

      The observation that cortical activity looks more like the pattern-generating modes in the RNN than the EMG seem to be consistent with this interpretation. On the other hand, speed-dependent shifts for motor cortical activity in walking cats (where the pattern generator survives the removal of cortex and is known to be spinal) seems qualitatively similar to the speed modulation reported here, at least at the level of single neurons (e.g., Armstrong & Drew, J. Physiol. 1984; Beloozerova & Sirota, J. Physiol. 1993). More generally, the authors may wish to contextualize their work within the broader literature on mammalian central pattern generators.

      We agree our discussion of this topic was thin. We have expanded the relevant section of the Discussion. Interestingly, Armstrong 1984 and Beloozerova 1993 both report quite modest changes in cortical activity with speed during locomotion (very modest in the case of Armstrong). The Foster et al. study agrees with those earlier studies, although the result is more implicit (things are stacked, but separation is quite small). Thus, there does seem to be an intriguing difference between what is observed in cortex during cycling (where cortex presumably participates heavily in rhythm/pattern generation) and during locomotion (where it likely does not, and concerns itself more with alterations of gait). This is now discussed:

      "Such considerations may explain why (Foster et al. 2014), studying cortical activity during locomotion at different speeds, observed stacked-elliptical structure with far less trajectory separation; the ‘stacking’ axis captured <1% of the population variance, which is unlikely to provide enough separation to minimize tangling. This agrees with the finding that speed-based modulation of motor cortex activity during locomotion is minimal (Armstrong and Drew 1984) or modest (Beloozerova and Sirota 1993). The difference between cycling and locomotion may reflect cortex playing a less-central role in the latter. Cortex is very active during locomotion, but that may reflect cortex being ‘informed’ of the spinally generated locomotor rhythm for the purpose of generating gait corrections if necessary (Drew and Marigold 2015; Beloozerova and Sirota 1993). If so, there would be no need for trajectories to be offset between speeds because they are input-driven, and need not display low tangling."

      For instance, some conclusions of this study seem to parallel experimental work on the locomotor CPG, where a constant input (electrical or optogenetic stimulation of the MLR at a frequency well above the stepping rate) drives walking, and changes in this input smoothly modulate step frequency.

      We now mention this briefly when introducing the simulated networks and the modeling choices that we made:

      "Speed was instructed by the magnitude of a simple static input. This choice was made both for simplicity and by rough analogy to the locomotor system; spinal pattern generation can be modulated by constant inputs from supraspinal areas (Grillner, S. 1997). Of course, cycling is very unlike locomotion and little is known regarding the source or nature of the commanding inputs. We thus explore other possible input choices below."

      If the input to the RNN were rhythmic, the network dynamics would likely be qualitatively different. The use of a constant input is reasonable, but it would be useful for the authors to elaborate on this choice and its implications for network dynamics and control. For example, one might expect high tangling to present less of a problem for a periodically forced system than a time-invariant system. This issue is raised in line 210ff, but could be developed a bit further.

      To investigate, we trained networks (many, each with a different initial weight initialization) to perform the same task but with a periodic forcing input. The stacked-elliptical solution often occurred, but other solutions were also common. The non-stacking solutions relied strongly on the ‘tilt’ strategy, where trajectories tilt into different dimensions as speed changes. There is of course nothing wrong with the ‘tilting’ strategy; it is a perfectly good way to keep tangling low. And of course it was also used (in addition to stacking) by both the empirical data and by graded-input networks (see section titled ‘Trajectories separate into different dimensions’). This is now described in the text (and shown in Figure 3 - figure supplement 2):

      "We also explored another plausible input type: simple rhythmic commands (two sinusoids in quadrature) to which networks had to phase-lock their output. Clear orderly stacking with speed was prominent in some networks but not others (Figure 3 - figure supplement 2a,b). A likely reason for the variability of solutions is that rhythmic-input-receiving networks had at least two “choices”. First, they could use the same stacked-elliptical solution, and simply phase-lock that solution to their inputs. Second, they could adopt solutions with less-prominent stacking (e.g., they could rely primarily on ‘tilting’ into new dimensions, a strategy we discuss further in a subsequent section)."

      This addition is clarifying because knowing that there are other reasonable solutions (e.g., pure tilt with little stacking), as it makes it more interesting that the stacked-elliptical solution was observed empirically. At the same time, the lesson to be drawn from the periodically forced networks isn’t 100% clear. They sometimes produced solutions with realistic stacking, so they are clearly compatible with the data. On the other hand, they didn’t do so consistently, so perhaps this makes them a bit less appealing as a hypothesis. Potentially more appealing is the hypothesis that both input types (a static, graded input instructing speed and periodic inputs instructing phase) are used. We strongly suspect this could produce consistently realistic solutions. However, in the end we decided we didn’t want to delve too much into this, because neither our data nor our models can strongly constrain the space of likely network inputs. This is noted in the Discussion:

      "The desirability of low tangling holds across a broad range of situations (Russo et al. 2018). Consistent with this, we observed stacked-elliptical structure in networks that received only static commands, and in many of the networks that received rhythmic forcing inputs. Thus, the empirical population response is consistent with motor cortex receiving a variety of possible input commands from higher motor areas: a graded speed-specifying command, phase-instructing rhythmic commands, or both.."

      The use of a constant input should also be discussed in the context of cortical physiology, as motor cortex will receive rhythmic (e.g., sensory) input during the task. The argument that time-varying input to cortex will itself be driven by cortical output (475ff) is plausible, but the underlying assumption that cortex is the principal controller for this movement should be spelled out. Furthermore, this argument would suggest that the RNN dynamics might reflect, in part, the dynamics of the arm itself, in addition to those of the brain regions discussed in line 462ff. This could be unpacked a bit in the Discussion.


      We agree this is an important topic and worthy of greater discussion. We have also added simulations that directly address this topic. These are shown in the new Figure 9 and described in the new section ‘Generality of the network solution’:

      "Given that stacked-elliptical structure can instantiate a wide variety of input-output relationships, a reasonable question is whether networks continue to adopt the stacked-elliptical solution if, like motor cortex, they receive continuously evolving sensory feedback. We found that they did. Networks exhibited the stacked-elliptical structure for a variety of forms of feedback (Figure 9b,c, top rows), consistent with prior results (Sussillo et al. 2015). This relates to the observation that “expected” sensory feedback (i.e., feedback that is consistent across trials) simply becomes part of the overall network dynamics (M. G. Perich et al. 2020). Network solutions remained realistic so long as feedback was not so strong that it dominated network activity. If feedback was too strong (Figure 9b,c, bottom rows), network activity effectively became a representation of sensory variables and was no longer realistic."

      We agree that the observed dynamics may “reflect, in part, the dynamics of the arm itself, in addition to those of the brain regions discussed”, as the reviewer says. At the same time, it seems to us quite unlikely that they primarily reflect the dynamics of the arm. We have added the following to the Discussion to outline what we think is most likely:

      "This second observation highlights an important subtlety. The dynamics shaping motor cortex population trajectories are widely presumed to reflect multiple forms of recurrence (Churchland et al. 2012): intracortical, multi-area (Middleton and Strick 2000; Wang et al. 2018; Guo et al. 2017; Sauerbrei et al. 2020) and sensory reafference (Lillicrap and Scott 2013; Pruszynski and Scott 2012). Both conceptually (M. G. Perich et al. 2020) and in network models (Sussillo et al. 2015), predictable sensory feedback becomes one component supporting the overall dynamics. Taken to an extreme, this might suggest that sensory feedback is the primary source of dynamics. Perhaps what appear to be “neural dynamics” merely reflect incoming sensory feedback mixed with outgoing commands. A purely feedforward network could convert the former into the latter, and might appear to have rich dynamics simply because the arm does (Kalidindi et al. 2021). While plausible, this hypothesis strikes us as unlikely. It requires sensory feedback, on its own, to create low-tangled solutions across a broad range of tasks. Yet there exists no established property of sensory signals that can be counted on to do so. If anything the opposite is true: trajectory tangling during cycling is relatively high in somatosensory cortex even at a single speed (Russo et al. 2018). The hypothesis of purely sensory-feedback-based dynamics is also unlikely because population dynamics begin unfolding well before movement begins (Churchland et al. 2012). To us, the most likely possibility is that internal neural recurrence (intra- and inter-area) is adjusted during learning to ensure that the overall dynamics (which will incorporate sensory feedback) provide good low-tangled solutions for each task. This would mirror what we observed in networks: sensory feedback influenced dynamics but did not create its dominant structure. Instead, the stacked-elliptical solution emerged because it was a ‘good’ solution that optimization found by shaping recurrent connectivity."

      As the reviewer says, our interpretation does indeed assume M1 is central to movement control. But of course this needn’t (and probably doesn’t) imply dynamics are only due to intra-M1 recurrence. What is necessarily assumed by our perspective is that M1 is central enough that most of the key signals are reflected there. If that is true, tangling should be low in M1. To clarify this reasoning, we have restructured the section of the Discussion that begins with ‘Even when low tangling is desirable’.

      The low tangling in the dominant dimensions of the RNN is interpreted as a signature of robust pattern generation in these dimensions (lines 207ff, 291). Presumably, dimensions related to muscle activity have higher tangling. If these muscle-related dimensions transform the smooth, rhythmic pattern into muscle activity, but are not involved in the generation of this smooth pattern, one might expect that recurrent dynamics are weaker in these muscle-related dimensions than in the first three principal components. That is, changes along the dominant, pattern-generating dimensions might have a strong influence on muscle-related dimensions, while changes along muscle-related dimensions have little impact on the dominant dimensions. Is this the case?


      A great question and indeed it is the case. We have added perturbation analyses of the model showing this (Figure 3f). The results are very clear and exactly as the reviewer intuited.

      It would be useful to have more information on the global dynamics of the RNN; from the figures, it is difficult to determine the flow in principal component space far from the limit cycle. In Fig. 3E (right), perturbations are small (around half the distance to the limit cycle for the next speed); if the speed is set to eight, would trajectories initialized near the bottom of the panel converge to the red limit cycle? Visualization of the vector field on a grid covering the full plotting region in Fig. 3D-E with different speeds in different subpanels would provide a strong intuition for the global dynamics and how they change with speed.


      We agree that both panels in Figure 3e were hard to visually parse. We have improved it, but fundamentally it is a two-dimensional projection of a flow-field that exists in many dimensions. It is thus inevitable that it is hard to follow the details of the flow-field, and we accept that. What is clear is that the system is stable: none of the perturbations cause the population state to depart in some odd direction, or fall into some other attractor or limit cycle. This is the main point of this panel and the text has been revised to clarify this point:

      "When the network state was initialized off a cycle, the network trajectory converged to that cycle. For example, in Figure 3e (left) perturbations never caused the trajectory to depart in some new direction or fall into some other limit cycle; each blue trajectory traces the return to the stable limit cycle (black).

      Network input determined which limit cycle was stable (Figure 3e, right)."

      One could of course try and determine more about the flow-fields local to the trajectories. E.g., how quickly do they return activity to the stable orbit? We now explore some aspects of this in the new Figure 3f, which gets at a property that is fundamental to the elliptical solution. At the same time, we stress that some other details will be network specific. For example, networks trained in the presence of noise will likely have a stronger ‘pull’ back to the canonical trajectory. We wish to avoid most of these details to allow us to concentrate on features of the solution that 1) were preserved across networks and 2) could be compared with data.

      What was the goodness-of-fit of the RNN model for individual muscles, and how was the mean-squared error for the EMG principal components normalized (line 138)? It would be useful to see predicted muscle activity in a similar format as the observed activity (Fig. 2D-F), ideally over two or three consecutive movement cycles.

      The revision clarifies that the normalization is just the usual one we are all used to when computing the R^2 (normalization by total variance). We have improved this paragraph:

      "Success was defined as <0.01 normalized mean-squared error between outputs and targets (i.e., an R^2 > 0.99). Because 6 PCs captured ~95% of the total variance in the muscle population (94.6 and 94.8% for monkey C and D), linear readouts of network activity yielded the activity of all recorded muscles with high fidelity."

      Given this accuracy, plotting network outputs would be redundant with plotting muscle activity as they would look nearly identical (and small differences would of course be different for every network.

      A related issue is whether the solutions are periodic for each individual node in the 50-dimensional network at each speed (as is the case for the first few RNN principal components and activity in individual cortical neurons and the muscles). If so, this would seem to guarantee that muscle decoding performance does not degrade over many movement cycles. Some additional plots or analysis might be helpful on this point: for example, a heatmap of all dimensions of v(t) for several consecutive cycles at the same speed, and recurrence plots for all nodes. Finally, does the period of the limit cycle in the dominant dimensions match the corresponding movement duration for each speed?


      These are good questions; it is indeed possible to obtain ‘degenerate’ non-periodic solutions if one is not careful during training. For example, if during training, you always ask for 3 cycles, it becomes possible for the network to produce a periodic output based on non-periodic internal activity. To ensure this did not happen, we trained networks with variable number of cycles. Inspection confirmed this was successful: all neurons (and the ellipse that summarizes their activity) showed periodic activity. These points are now made in the text:

      "Networks were trained across many simulated “trials”, each of which had an unpredictable number of cycles. This discouraged non-periodic solutions, which would be likely if the number of cycles were fixed and small.

      Elliptical network trajectories formed stable limit cycles with a period matching that of the muscle activity at each speed."

      We also revised the relevant section of the Methods to clarify how we avoided degenerate solutions, see section beginning with:

      “One concern, during training, is that networks may learn overly specific solutions if the number of cycles is small and stereotyped”.

      How does the network respond to continuous changes in input, particularly near zero? If a constant input of 0 is followed by a slowly ramping input from 0-1, does the solution look like a spring, as might be expected based on the individual solutions for each speed? Ramping inputs are mentioned in the Results (line 226) and Methods (line 805), but I was unable to find this in the figures. Does the network have a stable fixed point when the input is zero?


      For ramping inputs within the trained range, it is exactly as the reviewer suggests. The figure below shows a slowly ramping input (over many seconds) and the resulting network trajectory. That trajectory traces a spiral (black) that traverses the ‘static’ solutions (colored orbits).

      It is also true that activity returns to baseline levels when the input is turned off and network output ceases. For example, the input becomes zero at time zero in the plot below.

      The text now notes the stability when stopping:

      "When the input was returned to zero, the elliptical trajectory was no longer stable; the state returned close to baseline (not shown) and network output ceased."

      The text related to the ability to alter speed ‘on the fly’ has also been expanded:

      "Similarly, a ramping input produced trajectories that steadily shifted, and steadily increased in speed, as the input ramped (not shown). Thus, networks could adjust their speed anywhere within the trained range, and could even do so on the fly."

      The Discussion now notes that this ramping of speed results in a helical structure. The Discussion also now notes, informally, that we have observed this helical structure in motor cortex. However, we don’t want to delve into that topic further (e.g., with direct comparisons) as those are different data from a different animal, performing a somewhat different task (point-to-point cycling).

      As one might expect, network performance outside the trained range of speeds (e.g., during an input is between zero and the slowest trained speed) is likely to be unpredictable and network-specific. There is likely is a ‘minimum speed’ below which networks can’t cycle. This appeared to also be true of the monkeys; below ~0.5 Hz their cycling became non-smooth and they tended to stop at the bottom. (This is why our minimum speed is 0.8 Hz). However, it is very unclear whether there in any connection between these phenomena and we thus avoid speculating.

      Why were separate networks trained for forward and backward rotations? Is it possible to train a network on movements in both directions with inputs of {-8, …, 8} representing angular velocity? If not, the authors should discuss this limitation and its implications.


      Yes, networks can readily be trained to perform movements in both directions, each at a range of speeds. This is now stated:

      "Each network was trained to produce muscle activity for one cycling direction. Networks could readily be trained to produce muscle activity for both cycling directions by providing separate forward- and backward-commanding inputs (each structured as in Figure 3a). This simply yielded separate solutions for forward and backward, each similar to that seen when training only that direction. For simplicity, and because all analyses of data involve within-direction comparisons, we thus consider networks trained to produce muscle activity for one direction at a time."

      As noted, networks simply found independent solutions for forward and backward. This is consistent with prior work where the angle between forward and backward trajectories in state space is sizable (Russo et al. 2018) and sometimes approaches orthogonality (Schroeder et al. 2022).

      It is somewhat difficult to assess the stability of the limit cycle and speed of convergence from the plots in Fig. 3E. A plot of the data in this figure as a time series, with sweeps from different initial conditions overlaid (and offset in time so trajectories are aligned once they're near the limit cycle), would aid visualization. Ideally, initial conditions much farther from the limit cycle (especially in the vertical direction) would be used, though this might require "cutting and pasting" the x-axis if convergence is slow. It might also be useful to know the eigenvalues of the linearized Poincaré map (choosing a specific phase of the movement) at the fixed point, if this is computationally feasible.

      See response to comment 4 above. The new figure 3f now shows, as a time series, the return to the stable orbit after two types of perturbations. This specific analysis was suggested by the reviewer above, and we really like it because it gets at how the solution works. One could of course go further and try to ascertain other aspects of stability. However, we want to caution that is a tricky and uncertain path. We found that the overall stacked-elliptical solution was remarkably consistent among networks (it was shown by all networks that received a graded speed-specifying input). The properties documented in Figure 3f are a consistent part of that consistent solution. However, other detailed properties of the flow field likely won’t be. For example, some networks were trained in the presence of noise, and likely have a much more rapid return to the limit cycle. We thus want to avoid getting too much into those specifics, as we have no way to compare with data and determine which solutions mimic that of the brain.

      Reviewer #2 (Public Review):

      The study from Saxena et al "Motor cortex activity across movement speeds is predicted by network-level strategies for generating muscle activity" expands on an exciting set of observations about neural population dynamics in monkey motor cortex during well trained, cyclical arm movements. Their key findings are that as movement speed varies, population dynamics maintain detangled trajectories through stacked ellipses in state space. The neural observations resemble those generated by in silico RNNs trained to generate muscle activity patterns measured during the same cycling movements produced by the monkeys, suggesting a population mechanism for maintaining continuity of movement across speeds. The manuscript was a pleasure to read and the data convincing and intriguing. I note below ideas on how I thought the study could be improved by better articulating assumptions behind interpretations, defense of the novelty, and implications could be improved, noting that the study is already strong and will be of general interest.

      We thank the reviewer for the kind words and nice summary of our results.

      Primary concerns/suggestions:

      1 Novelty: Several of the observations seem an incremental change from previously published conclusions. First, detangled neural trajectories and tangled muscle trajectories was a key conclusion of a previous study from Russo et al 2018. The current study emphasizes the same point with the minor addition of speed variance. Better argument of the novelty of the present conclusions is warranted. Second, the observations that motor cortical activity is heterogenous are not new. That single neuronal activity in motor cortex is well accounted for in RNNs as opposed to muscle-like command patterns or kinematic tuning was a key conclusion of Sussillo et al 2015 and has been expanded upon by numerous other studies, but is also emphasized here seemingly as a new result. Again, the study would benefit from the authors more clearly delineating the novel aspects of the observations presented here.

      The extensive revisions of the manuscript included multiple large and small changes to address these points. The revisions help clarify that our goal is not to introduce a new framework or hypothesis, but to test an existing hypothesis and see whether it makes sense of the data. The key prior work includes not only Russo and Sussillo but also much of the recent work of Jazayeri, who found a similar stacked-elliptical solution in a very different (cognitive) context. We agree that if one fully digested Russo et al. 2018 and fully accepted its conclusions,then many (but certainly not all) of the present results are expected/predicted in their broad strokes. (Similarly, if one fully digested Sussillo et al. 2015, much of Russo et al. is expected in its broad strokes). However, we see this as a virtue rather than a shortcoming. One really wants to take a conceptual framework and test its limits. And we know we will eventually find those limits, so it is important to see how much can be explained before we get there. This is also important because there have been recent arguments against the explanatory utility of network dynamics and the style of network modeling we use to generate predictions. Iit has been argued that cortical dynamics during reaching simply reflect sequence-like bursts, or arm dynamics conveyed via feedback, or kinematic variables that are derivatives of one another, or even randomly evolving data. We don’t want to engage in direct tests of all these competing hypotheses (some are more credible than others) but we do think it is very important to keep adding careful characterizations of cortical activity across a range of behaviors, as this constrains the set of plausible hypotheses. The present results are quite successful in that regard, especially given the consistency of network predictions. Given the presence of competing conceptual frameworks, it is far from trivial that the empirical data are remarkably well-predicted and explained by the dynamical perspective. Indeed, even for some of the most straightforward predictions, we can’t help but remain impressed by their success. For example, in Figure 4 the elliptical shape of neural trajectories is remarkably stable even as the muscle trajectories take on a variety of shapes. This finding also relates to the ‘are kinematics represented’ debate. Jackson’s preview of Russo et al. 2018 correctly pointed out that the data were potentially compatible with a ‘position versus velocity’ code (he also wisely noted this is a rather unsatisfying and post hoc explanation). Observing neural activity across speeds reveals that the kinematic explanation isn’t just post hoc, it flat out doesn’t work. That hypothesis would predict large (~3-fold) changes in ellipse eccentricity, which we don’t observe. This is now noted briefly (while avoiding getting dragged too far into this rabbit hole):

      "Ellipse eccentricity changed modestly across speeds but there was no strong or systematic tendency to elongate at higher speeds (for comparison, a ~threefold elongation would be expected if one axis encoded cartesian velocity)."

      Another result that was predicted, but certainly didn’t have to be true, was the continuity of solutions across speeds. Trajectories could have changed dramatically (e.g., tilted into completely different dimensions) as speed changed. Instead, the translation and tilt are large enough to keep tangling low, while still small enough that solutions are related across the ~3-fold range of speeds tested. While reasonable, this is not trivial; we have observed other situations where disjoint solutions are used (e.g., Trautmann et al. COSYNE 2022). We have added a paragraph on this topic:

      "Yet while the separation across individual-speed trajectories was sufficient to maintain low tangling, it was modest enough to allow solutions to remain related. For example, the top PCs defined during the fastest speed still captured considerable variance at the slowest speed, despite the roughly threefold difference in angular velocity. Network simulations (see above) show both that this is a reasonable strategy and also that it isn’t inevitable; for some types of inputs, solutions can switch to completely different dimensions even for somewhat similar speeds. The presence of modest tilting likely reflects a balance between tilting enough to alter the computation while still maintaining continuity of solutions."

      As the reviewer notes, the strategy of simulating networks and comparing with data owes much to Sussillo et al. and other studies since then. At the same time, there are aspects of the present circumstances that allow greater predictive power. In Sussillo, there was already a set of well-characterized properties that needed explaining. And explaining those properties was challenging, because networks exhibited those properties only if properly regularized. In the present circumstance it is much easier to make predictions because all networks (or more precisely, all networks of our ‘original’ type) adopted an essentially identical solution. This is now highlighted better:

      "In principle, networks did not have to find this unified solution, but in practice training on eight speeds was sufficient to always produce it. This is not necessarily expected; e.g., in (Sussillo et al. 2015), solutions were realistic only when multiple regularization terms encouraged dynamical smoothness. In contrast, for the present task, the stacked-elliptical structure consistently emerged regardless of whether we applied implicit regularization by training with noise."

      It is also worth noting that Foster et al. (2014) actually found very minimal stacking during monkey locomotion at different speeds, and related findings exist in cats. This likely reflects where the relevant dynamics are most strongly reflected. The discussion of this has been expanded:

      "Such considerations may explain why (Foster et al. 2014), studying cortical activity during locomotion at different speeds, observed stacked-elliptical structure with far less trajectory separation; the ‘stacking’ axis captured <1% of the population variance, which is unlikely to provide enough separation to minimize tangling. This agrees with the finding that speed-based modulation of locomotion is minimal (Armstrong and Drew 1984) or modest (Beloozerova and Sirota 1993) in motor cortex. The difference between cycling and locomotion may be due to cortex playing a less-central role in the latter. Cortex is very active during locomotion, but that likely reflects cortex being ‘informed’ of the spinally generated locomotor rhythm for the purpose of generating gait corrections if necessary (Drew and Marigold 2015; Beloozerova and Sirota 1993). If so, there would be no need for trajectories to be offset between speeds because they are input-driven, and need not display low tangling."

      2 Technical constraints on conclusions: It would be nice for the authors to comment on whether the inherent differences in dimensionality between structures with single cell resolution (the brain) and structures with only summed population activity resolution (muscles) might contribute to the observed results of tangling in muscle state space and detangling in neural state spaces. Since whole muscle EMG activity is a readout of a higher dimensional control signals in the motor neurons, are results influenced by the lack of dimensional resolution at the muscle level compared to brain? Another way to put this might be, if the authors only had LFP data and motor neuron data, would the same effects be expected to be observed/ would they be observable? (Here I am assuming that dimensionality is approximately related to the number of recorded units * time unit and the nature of the recorded units and signals differs vastly as it does between neuronal populations (many neurons, spikes) and muscles (few muscles with compound electrical myogram signals). It would be impactful were the authors to address this potential confound by discussing it directly and speculating on whether detangling metrics in muscles might be higher if rather than whole muscle EMG, single motor unit recordings were made.

      We have added the following to the text to address the broad issue of whether there is a link between dimensionality and tangling:

      "Neural trajectory tangling was thus much lower than muscle trajectory tangling. This was true for every condition and both monkeys (paired, one-tailed t-test; p<0.001 for every comparison). This difference relates straightforwardly to the dominant structure visible in the top two PCs; the result is present when analyzing only those two PCs and remains similar when more PCs are considered (Figure 4 - figure supplement 1). We have previously shown that there is no straightforward relationship between high versus low trajectory tangling and high versus low dimensionality. Instead, whether tangling is low depends mostly on the structure of trajectories in the high-variance dimensions (the top PCs) as those account for most of the separation amongst neural states."

      As the reviewer notes, the data in the present study can’t yet address the more specific question of whether EMG tangling might be different at the level of single motor units. However, we have made extensive motor unit recordings in a different task (the pacman task). It remains true that neural trajectory tangling is much lower than muscle trajectory tangling. This is true even though the comparison is fully apples-to-apples (in both cases one is analyzing a population of spiking neurons). A manuscript is being prepared on this topic.

      3 Terminology and implications: A: what do the authors mean by a "muscle-like command". What would it look like and not look like? A rubric is necessary given the centrality of the idea to the study.

      We have completely removed this term from the manuscript (see above).

      B: if the network dynamics represent the controlled variables, why is it considered categorically different to think about control of dynamics vs control of the variables they control? That the dynamical systems perspective better accounts for the wide array of single neuronal activity patterns is supportive of the hypothesis that dynamics are controlling the variables but not that they are unrelated. These ideas are raised in the introduction, around lines 39-43, taking on 'representational perspective' which could be more egalitarian to different levels of representational codes (populations vs single neurons), and related to conclusions mentioned later on: It is therefore interesting that the authors arrive at a conclusion line 457: 'discriminating amongst models may require examining less-dominant features that are harder to visualize and quantify'. I would be curious to hear the authors expand a bit on this point to whether looping back to 'tuning' of neural trajectories (rather than single neurons) might usher a way out of the conundrum they describe. Clearly using population activity and dynamical systems as a lens through which to understand cortical activity has been transformative, but I fail to see how the low dimensional structure rules out representational (population trajectory) codes in higher dimensions.

      We agree. As Paul Cisek once wrote: the job of the motor system is to produce movement, not describe it. Yet to produce it, there must of course be signals within the network that represent the output. We have lightly rephrased a number of sentences in the Introduction to respect this point. We have also added the following text:

      "This ‘network-dynamics’ perspective seeks to explain activity in terms of the underlying computational mechanisms that generate outgoing commands. Based on observations in simulated networks, it is hypothesized that the dominant aspects of neural activity are shaped largely by the needs of the computation, with representational signals (e.g., outgoing commands) typically being small enough that few neurons show activity that mirrors network outputs. The network-dynamics perspective explains multiple response features that are difficult to account for from a purely representational perspective (Churchland et al. 2012; Sussillo et al. 2015; Russo et al. 2018; Michaels, Dann, and Scherberger 2016)."

      As requested, we have also expanded upon the point about it being fair to consider there to be representational codes in higher dimensions:

      "In our networks, each muscle has a corresponding network dimension where activity closely matches that muscle’s activity. These small output-encoding signals are ‘representational’ in the sense that they have a consistent relationship with a concrete decodable quantity. In contrast, the dominant stacked-elliptical structure exists to ensure a low-tangled scaffold and has no straightforward representational interpretation."

      4 Is there a deeper observation to be made about how the dynamics constrain behavior? The authors posit that the stacked elliptical neural trajectories may confer the ability to change speed fluidly, but this is not a scenario analyzed in the behavioral data. Given that the authors do not consider multi-paced single movements it would be nice to include speculation on what would happen if a movement changes cadence mid cycle, aside from just sliding up the spiral. Do initial conditions lead to predictions from the geometry about where within cycles speed may change the most fluidly or are there any constraints on behavior implied by the neural trajectories?

      These are good questions but we don’t yet feel comfortable speculating too much. We have only lightly explored how our networks handle smoothly changing speeds. They do seem to mostly just ‘slide up the spiral’ as the reviewer says. However, we would also not be surprised if some moments within the cycle are more natural places to change cadence. We do have a bit of data that speaks to this: one of the monkeys in a different study (with a somewhat different task) did naturally speed up over the course of a seven cycle point-to-point cycling bout. The speeding-up appears continuous at the neural level – e.g., the trajectory was a spiral, just as one would predict. This is now briefly mentioned in the Discussion in the context of a comparison with SMA (as suggested by this reviewer, see below). However, we can’t really say much more than this, and we would definitely not want to rule out the hypothesis that speed might be more fluidly adjusted at certain points in the cycle.

      5 Could the authors comment more clearly if they think that state space trajectories are representational and if so, whether the conceptual distinction between the single-neuron view of motor representation/control and the population view are diametrically opposed?

      See response to comment 3B above. In most situations the dynamical network perspective makes very different predictions from the traditional pure representational perspective. So in some ways the perspectives are opposed. Yet we agree that networks do contain representations – it is just that they usually aren’t the dominant signals. The text has been revised to make this point.

    1. Author Response:

      Reviewer #3 (Public Review):

      The Schepartz lab have previously shown that the binding of growth factors results in the formation of two distinct coiled coil dimers within the juxtamembrane (JM) segment. These two isomeric coiled coil structures are also allosterically preferred by point mutations within transmembrane (TM) helix. In this manuscript, authors demonstrate that the JM coiled coil is a binary switch, governing the trafficking status of EGFR, either towards degradative or recycling pathway.

      They design novel variants of EGFR (E661R and KRAA) that mimic the two distinct coiled coil types, EGF-type and TGF-α-type. These variants are further validated using bipartite tetracysteine- ReAsH system. In order to assess the trafficking of these variants, authors use confocal imaging to measure colocalization with respective organelle markers. In addition, authors also use variants with point mutations at TM segment that controls the JM coiled coil state to demonstrate that the trafficking is dependent on JM segment and not growth factor identity. EGFR signaling is of prime importance in cancer biology and trafficking plays a major role, where the degradative pathway decreases the signaling, in contrast to recycling pathway that sustains the signaling. The authors clearly demonstrate this switch in EGFR lifetime using relevant variants and show how well-known tyrosine kinase inhibitors regulate this in a drug resistant non-small cell lung cancer model.

      The model proposed by the authors is mostly well supported by data, but few points require clarification.

      i) The authors need to address why the switch is incomplete when JM mutants are used but appears complete with TM mutants. A) Does this mean recycling requires other criteria in addition to JM segment? B) Is it possible that TM mutants cause other changes in addition to controlling JM segment? C) Would it be better if organelle transmembrane markers were used (Tf, Lamp1, NPC1 etc.).

      The revised manuscript now includes a discussion of why the localization switch is less complete for the JM mutants than for the TM mutants. Whether these differences mean that the direction of trafficking requires direct interactions with the JM segment, or alternatively that the TM mutants cause other relevant changes in EGFR is currently under investigation.

      ii) It would be helpful to represent data as a distribution or scatter points instead of bar plot. Did authors observe any expression level dependence on their colocalization and lifetime assays?

      Figures 2 and 3 have been changed to illustrate both bars and individual points. We did not evaluate the effect of expression level on the extent of colocalization or EGFR lifetime.

      iii) Did authors investigate the lifetime of JM variants? Like it was shown with TM variants in Fig 4.

    1. Author Response

      Reviewer #1 (Public Review):

      Dosil et al. have extensively analyzed NK cell-derived extracellular vesicles containing miRNAs. They analyzed the miRNAs in NK cell-derived EVs and found that specific types of miRNAs are contained in NK cell-derived EVs. Furthermore, they found that NK cell-derived EVs have immunomodulatory functions for T-cell response as well as for monocytes and moDCs. This paper is well designed and provides important information on NK cell-derived EVs. However, it is unclear whether NK cell-derived EVs are different from EVs derived from other immune cells such as T cells and B cells.

      We thank the reviewer for his/her comments and for pointing out this key point.

      1) The authors analyzed human NK cell-derived EVs. The repertoire of miRNAs in NK-EVs may differ among individuals. It would be better to show the degree of individual differences.

      We thank the reviewer for highlighting this point and agree that miRNA content in NK-EVs differs among individuals. We have now included a separate table where we show the relative abundance of EV-miRNAs in secreting activated NK cells and their secreted EVs from small RNA sequencing data, and the corresponding plots, including statistics (new Figure 1-figure supplement 2B,C). However, it is important to highlight that the enrichment of these miRNAs in NK-EVs compared to their parental cells is consistent within individuals, as shown in Figure 1- figure supplement 2 and Supplementary Table S1 where all individual data are shown.

      Furthermore, to address the reviewer concern of whether NK-EV content differs from that of other EVs from different cell types we have further analyzed the average ratio of EVs vs secreting cells from a recent article (11) and found that the enrichment of specific miRNAs in NK-EVs is rather cell specific and differs from other unrelated cells such as white fat and hepatic cells, as shown in Figure Review 1 below.

      Figure Review 1. Parental cell and EV expression of NK-EV enriched miRNAs

      2) The authors analyzed the effect of NK-EVs on T cell response in Fig. 4. However, it is possible that EVs affect T cell responses in a nonspecific manner. It may be necessary to include control EVs.

      To address this key point raised by the reviewer, several new experiments were performed.

      First, small EVs from two distinct human cell lines (namely the HEK-293, human epithelial kidney cells and the Raji B lymphoblast cells) were isolated, following the differential ultracentrifugation protocol, as described in the methods section. Their effects in primary T cells isolated from human healthy donors showed no impact, neither in IFN-γ secretion (new Figure 3-figure supplement 3), nor in activation, measured by CD25 expression (Figure 4-figure supplement 2E,F), that even decreased upon Raji B cell EV-treatment under Th1-polarizing conditions.

      Also, three microRNAs that are preferentially excluded from the NK-EV fraction were selected, namely hsa-miR-124, hsa-miR-3667 and hsa-miR-4158 and loaded onto gold-nanoparticles (new Figure 6-figure supplement 2), and their effects were evaluated in immunocompetent C57/BL6 mice after footpad injection. These experiments showed no effects of these nanoparticles, as observed for NK-EV enriched microRNAs, neither in activation, nor in IFN-γ secretion (new Figure 6H).

    1. Author Response:

      Reviewer #1:

      Lee et al. identify miR-20b as a molecular regulator of hepatic lipid metabolism through the post-transcriptional regulation of the nuclear receptor PPAR alpha. Through mechanistic studies the authors identified the 3'UTR of PPARa as a direct target for miR-20b regulation of expression. The experiments are well controlled and the study provides deep mechanistic insight into the miR-20b/PPARa circuit in modulating hepatic lipid metabolism. Furthermore, the authors provide evidence that targeting the miR-20b pathway to enhance PPARa activation via synthetic ligand fenofibrate. The studies provide much needed mechanistic insight into molecular regulators of hepatic lipid metabolism in response to nutrient stress such as high fat diet. While this is a detailed and thorough assessment of this pathway, there are several issues that were identified in the review of this article outlined below:

      1) The authors state there is no off target expression of miR-20b in adipose tissue in their over expression experiments. However, per figure 4 supplement 1, EpiWAT has increased expression over controls in HFD fed conditions. Furthermore, figure 4 supplement 2 shows a functional difference in EpiWAT weight in HFD where miR-20b treated mice have higher fat weight. The authors need at the least to discuss the potential role of adipose tissue in promoting their observed phenotype.

      This is a good point. We increased the number of samples and carefully analyzed the changes of both the expression of Mir20b and the weight of epididymal adipose tissue. We observed that slight increase of Mir20b expression in epididymal adipose tissue of AAV- miR20b HFD-fed mice compared to AAV-control NCD-fed mice, not HFD-fed mice. The expression of Mir20b in adipose tissue of between AAV-control HFD and AAV-Mir20b HFD mice was not significantly changed (Figure 5-figure supplement 1).

      We have revised the text and added the discussion about the potential role of adipose tissue (page 25-26, line 582-594). Hepatic steatosis could be affected by adipose tissue through free fatty acid (FFA) release and hepatic uptake of circulating FFAs (Rasineni et al., 2021). Our results showed that the epididymal adipose tissue of HFD-fed mice was enlarged upon AAV-Mir20b treatment; however, the serum FFA levels in these mice were comparable to those in mice treated with the AAV-Control (Figure 5-figure supplement 4B)). Of note, the expression of genes related to lipolysis did not change in adipose tissues, and that of hepatic FA transporter, CD36, was decreased by AAV-Mir20b treatment (Figure 5Q and Figure 5-figure supplement 4A). In addition, excess hepatic triglycerides (TGs) are secreted as very low density lipoproteins (VLDLs), and the secretion rate increases with the TG level (Fabbrini et al., 2008). VLDLs deliver TGs from the liver to adipose tissue and contributes expansion of adipose tissue (Chiba et al., 2003). Together, these reports suggest that adipose tissue is also remodeled by the liver in HFD-fed mice and non-alcoholic fatty liver disease (NAFLD) patients. Therefore, the levels of hepatic TGs are unlikely affected by epididymal adipose tissue, and the increase in fat content (Figure 5-figure supplement 3) may be a consequence of increased hepatic TG levels.

      2) Figure 5 shows anti-miR-20b essentially restores PPARa expression. However, the rescue effects in terms of body weight, liver triglycerides and liver damage are only modestly improved. The authors need to discuss this modest effect and potentially offer alternative mechanisms aside from PPARa as the physiological target.

      Previously, we introduced AAV treatment after four weeks of high fat diet (HFD) feeding. Anti-Mir20b treatment significantly changed the expression of PPARA; however, the effect on the pathophysiological properties of the liver was significant but modest. We thought that this was because there was not enough time to make a proper impact on the liver. Thus, to maximize the effect of ani-Mir20b, the AAV was administered when the HFD was started. The new results showed more significant effects of anti-Mir20b (Figure 6).

      We also observed that other nuclear receptors, such as RORA, RORC, and THRB, could be potential targets of MIR20B (Figure 2H and Figure 2-figure supplement 3). However, in the patient data, there was no significant correlation between the expression of those nuclear receptors and that of MIR20B. In addition, among the candidate targets, only PPARA was selected as an overlapped predicted target of MIR20B by various miRNA target prediction programs, including miRDB, picTAR, TargetSCAN, and miRmap (Figure 2J, Figure 2-figure supplement 2). Consistent with these results, we observed that Ppara, not other nuclear receptors, is the target gene of MiR20b in both AAV-Mir20b and AAV-anti- Mir20b mice (Figure 5-figure supplement 2, Figure 6-figure supplement 2). Thus, we focused on PPARA as a MIR20B target in NAFLD.

      3) The authors performed experiments with mutated 3'UTR of PPARa and show mutated PPARa is refractory to regulation by miR-20b. However, the authors provide no functional evidence that mutating the 3'UTR of PPARa elicits changes in hepatic lipid metabolism. Discussion of this point is needed at the minimal.

      Thank you for your comment. To provide functional evidence, we tried to establish the PPARA 3’UTR mutation knock-in (KI) system in cells. However, we could not succeed because of technical difficulties and time constraints. Alternatively, we introduced the wild type PPARA open reading frame (ORF) followed by either the wild type (WT) or mutant (Mut) 3’UTR of PPARA in HepG2 cells, and analyzed the importance of the 3’UTR of PPARA. As shown in Figure 2-figure supplement 5C, MIR20B significantly suppressed the expression of PPARA and its target genes in PPARA-3’UTR WT expressing cells. Furthermore, Oil Red O staining showed that MIR20B expression increased the intracellular lipid content in these cells (Figure 2-figure supplement 5B). However, MIR20B did not have an effect on either the expression of PPARA and its target genes or intracellular lipid content in PPARA-3’UTR Mut expressing cells (Figure 2-figure supplement 5C, D). We have added the new results in page 17-18, line 350-359 and Figure 2-figure supplement 5.

      Reviewer #2:

      1) In the experiments depicted in Figures 1D and E, did OA treatment of HepG2 and/or Huh-7 cells produce a reduction in the levels of mRNA encoding PPARalpha (or PPARalpha protein levels) in concordance with the shown rise in mRNA for miR-20b?

      Thank you for your question. The samples used in Figure 1C–E were also analyzed to observe the changes in the expression of PPARA (Figure 2-figure supplement 4A–C). In each sample, the increase in MIR20B expression resulting from oleic acid (OA) treatment and HFD was accompanied by a reduction in the levels of PPARA mRNA.

      2) Moreover, Figure 1 shows a fuller landscape of the transcriptional impact of microRNAs in context of obese livers in mice and human. Given this, what made miR20-b more interesting than, for example, miR106a, miR-17, or others that also appear to be robustly regulated? Why focus on miR20b?

      This is a very good point. In the analysis of the regulatory network, other miRNAs including MIR129 and MIR106A appeared to possibly regulate nuclear receptors in NAFLD. We further confirmed the relationship between candidate miRNAs and NAFLD progression in patient samples. As shown in the revised Figure 1B, we observed that the expression of MIR20B was more robustly and significantly changed with NAFLD progression than that of MIR129 and MIR106a. This tendency was also confirmed in other experiments using OA- treated HepG2 and Huh7 cells or HFD-fed mice (Figure 1-figure supplement 4). Thus, we focused on the role of MIR20B in NAFLD. Nevertheless, we do not rule out the possibility that other miRNAs may be involved in NAFLD progression. Subsequent studies may uncover the roles of other miRNAs in liver physiology.

      3) What does the rank and p-value exactly represent in tabular part of Figure 1A? This is very unclear as shown, including the figure legend.

      The p-values in the table of Figure 1A were obtained from the hypergeometric distribution used for testing the enrichment of downregulated nuclear receptors among the targets of a miRNA. In other words, they indicate the probability of having downregulated nuclear receptors among the miRNA targets. They were calculated by the following equation:

      where N is the total number of genes analyzed, M is the number of candidate target genes of the miRNA, D is the downregulated NR genes, and O is the observed overlap between miRNA targets and the downregulated NR genes as described in the Materials and Methods (page 9, line 155-157). The ranks in the table were determined according to the p-value. The legend of Figure 1A has been modified as follows:

      “Figure 1. MIR20B expression is significantly increased in the livers of dietary and genetic obese mice and humans. (A) The miRNA regulatory networks for NR genes downregulated in the transcriptome of NAFLD patients. The adjusted p-values in the table represent the enrichment of miRNA targets in the downregulated NR genes (hypergeometric distribution).”

      4) Figure 1, supplement 1 shows characteristics of patients involved in data for Figure 1, etc. This shows that the normal patients are younger than the other two groups, the M-F ratio is not identical (more female in the normal group), and the total cholesterol levels are not well matched either. What other parameters are available? Hemoglobin A1c? Fasting glucose? In the end, we need to know that the groups, apart from the severity of NAFLD and NASH, were well matched. Given the small size of each group (n = only 4-5, this matching is critical to avoid confounding of the relationship between miR-20b, PPARalpha, and NAFLD/NASH progression.

      Thank you for your comment. Accordingly, we have included the patient information in a table (Figure 1-figure supplement 1A, B). To increase the statistical power and prevent confounding effects, we increased the number of samples and tried to match them to compare age, weight, and male/female ratio between the groups. Due to the limited number of patient samples, the cohorts could not be perfectly matched. Nevertheless, there were no significant differences in age and male/female ratio among the three groups. Specifically, serum AST, ALT, and fasting glucose levels were significantly increased with progression from normal to non-alcoholic steatohepatitis (NASH), but total cholesterol was comparable as previously reported (Chung et al., 2020). We have revised the text in page 7-8, line 118- 130.

      5) The title of Figure 2 relates to PPARalpha. However, in Figure 2G, it is clear that several NRs are downregulated by miR20b overexpression in cells. Although the paper focuses on PPARalpha, should the authors not explore at least some of the other hits to ensure that the impact of PPARalpha is of particular importance vs. others?

      This is a good point. We also observed that other nuclear receptors, such as RORA, RORC, and THRB, could be potential targets of MIR20B (Figure 2H and Figure 2-figure supplement 3). However, in the patient data, there was no significant correlation between the expression of those nuclear receptors and that of MIR20B. In addition, among the candidate targets, only PPARA was selected as an overlapped predicted target of MIR20B by various miRNA target prediction programs, including miRDB, picTAR, TargetSCAN, and miRmap (Figure 2J, Figure 2-figure supplement 2). Consistent with these results, we observed that PPARA, not other nuclear receptors is the target gene of MIR20B in both AAV-Mir20b and AAV-anti-Mir20b mice (Figure 5-figure supplement 2, Figure 6-figure supplement 2). Thus, we focused on PPARA as a MIR20B target in NAFLD.

      6) In Figure 3, the data show, presumably, that OA induces miR20b, which then represses PPARalpha and, in turn, CD36 downstream of PPARalpha. If this is the case, then how does OA continue to get into the cells? Once CD36 expression falls dramatically, doesn't the key OA uptake mechanism fall with it? Then, does the induction of miR20b abate? Or, does FATP6 or another uptake mechanism account for OA entry into these cells?

      This is a good point. FA uptake was decreased by overexpression of MIR20B, and was accompanied by a considerable decrease in CD36 expression (Figure 4B, J). However, other lipid transporters such as FATPs were not significantly altered (Figure 4-figure supplement 5), suggesting that FA uptake is continued by these transporters. The expression of CD36 is relatively low in normal hepatocytes, and the molecule may not be the primary fatty acid transporter in these cells (Wilson et al., 2016). Furthermore, the decrease in FA uptake upon CD36 KO is modest even during a HFD (Wilson et al., 2016). In addition, we observed that the expression of MIR20B is induced and increased for up to 24 h by OA treatment. This is followed by a slight decrease, remaining at a constant elevated level (Figure 4-figure supplement 6). Together, the findings indicated that other fatty acid transporters contributing to FA uptake account for the entry of OA into cells. We have added these discussion in page 25, line 571-581.

      7) Similarly, what happens to AGPAT, GPAT, and DGAT expression in context of OA treatment and modulation of miR20b? Does the capacity of the cell to store OA in the form of triglyceride inside of lipid droplets change, so that the amount of free OA or oleyl-CoA inside the cell rises? Could this impact the transcriptional phenotype?

      This is a very good point. Accordingly, we analyzed the transcriptional phenotype in the context of OA treatment and modulation of MIR20B. The expression of glycerolipid synthetic genes, including AGPATs, GPATs, and DGATs, was increased by OA treatment, but MIR20B overexpression did not influence the expression of lipogenic genes except for that of DGAT1. However, treatment with anti-MIR20B significantly reduced the expression of glycerolipid synthetic genes, including GPATs and DGATs, under OA treatment (Figure 4C, N). These results suggested that MIR20B is necessary but not sufficient to induce the expression of glycerolipid synthetic genes under OA treatment. We have shown that OA induces the expression of MIR20B (Figure 1C), which can explain why MIR20B overexpression did not show an additional enhancement under OA treatment. The increase in DGAT1 expression induced by MIR20B might contribute to the increase in TG formation and capacity to store OA. This could change the flux of oleyl-CoA to TG synthesis, not β-oxidation with reduced expression of lipid oxidation-associated genes (Figure 4B). Thus, we can expect that the decrease in OA uptake and increase in TG formation induced by MIR20B resulted in reduced amounts of OA or oleyl-CoA inside the cell. However, as lipid consumption through FA oxidation is decreased by MIR20B, free OA or oleyl-CoA might be maintained at a stably increased level compared to that of OA-untreated MIR NC or MIR20B condition, and the impact of the changes in OA or oleyl-CoA levels on the transcriptional phenotype might not be significant as found in a constant elevated level of MIR20B by OA (Figure 4-figure supplement 6). We have added these results in Figure 4C and the Discussion (page 26, line 595-610). Due to technical constraints, we could not measure the amounts of free OA and oleyl-CoA.

      8) In Figure 3P, would the impact of anti-miR on the effect of OA on FASN be lost in PPARalpha KO cells? This would really test the functional relevance of the purported transcriptional hierarchy.

      Thank you for your valuable comment. We tested the impact of anti-MIR20B treatment on OA-treated PPARA knock-down (KD) cells, not KO cells, due to technical constraints. PPARA KD cells showed a significant decrease in PPARA expression. As shown in Figure 4- figure supplement 4I, anti-MIR20B treatment enhanced the expression of PPARA but did not have a significant effect on fatty acid synthase (FASN) expression in both control and PPARA KD cells. In addition, PPARA KD did not affect FASN expression. The expression patterns of PPARα target genes differ between mice and humans. FASN is regulated by PPARα in mice, but this is unclear in humans (Rakhshandehroo, Hooiveld, Muller, & Kersten, 2009; Rakhshandehroo, Knoch, Muller, & Kersten, 2010). Moreover, fenofibrate, a PPARα agonist, reduces the expression of FASN in methionine choline-deficient (MCD)-fed mice (Cui et al., 2021). Here, we used human HepG2 cells to investigate the effect of OA and MIR20B. It is plausible that FASN might not be regulated by PPARα in our system. We have added these results in Figure 4-figure supplement 4I.

      9) The authors should really at least perform a bulk RNAseq analysis to confirm the similarity of the effect of miR20b or anti-miR seen in cells, at the mouse or human liver tissue level. As it is, they only look at 3 FAOX genes, 2 FA uptake associated genes, and 2 FA synthesis genes. This is not very comprehensive as a validation of the in vitro data, although it is intriguing. Or, at the very least, look at a large validated set of PPARalpha target genes in vivo.

      Thank you for your comment. Accordingly, we selected PPARα target genes altered by MIR20B in OA-treated cells (Figure 4-figure supplement 1A, B), and then examined the hepatic expression of PPARα target genes in HFD-fed mice treated with MIR20B or anti- MIR20B (Figure 5R and 6R). The expression of most PPARα target genes was decreased by OA treatment and the HFD, and MIR20B treatment further reduced their expression. In contrast, anti-MIR20B treatment rescued the reduced expression of PPARα target genes under OA treatment and the HFD. These results suggested that MIR20B suppresses PPARA in vivo, which is consistent with the results from cells. We have added these results in Figure 4-figure supplement 1A, B, Figure 5R, and Figure 6R.

      10) Notably, the figures in general do NOT show individual data points. This is the standard for visual display, rather than bar graphs with simple SEM bars.

      Thank you for your comment. We have revised the graphs to include individual data points.

      11) The in vivo data (e.g. Figure 4) are very low n values. Augmenting this would add confidence to the data. As an example, of inconsistencies potentially stemming from very low n, the liver weights (Figure 4F) are not very different across groups, although the triglyceride levels in the livers (Figure 4H) are more than twice as high. The images of liver specimens shown as examples (Figure 4F) are also more dramatic than the weights would indicate. Note also that the body weights of the mice (Figure 4C) are different as well, and this alone could explain the livers being modestly heavier. Indeed, the extent of body weight excess mirrors the extent of liver weight excess, suggesting that the entire animal may be larger across multiple metabolic tissues including adipose. This is proven in Figure 4D, where the fat mass looks to be larger as well. To this end, Figure 4 supplement 2 shows multiple tissue weights to be increased in this model, suggesting that specificity for hepatic steatosis may be low.

      Thank you for your comment. Accordingly, we conducted additional in vivo experiments with larger n values (n = 10). Then, we replaced the liver images with more representative ones. AAV-Mir20b robustly induced the hepatic expression of Mir20b and significantly increased the liver weight and hepatic TG levels (Figure 5F, 5I). In the liver of normal human, intrahepatic TGs do not exceed 5 % of the liver weight (Fabbrini & Magkos, 2015). In our results, TG levels were increased more than three times by the HFD, but the impact on liver weight was limited, as TGs did not account for more than 10 % of the liver weight (Figure 5I). Excess hepatic TGs are secreted as very low density lipoproteins (VLDLs), and the secretion rate increases with the TG level (Fabbrini et al., 2008). VLDLs deliver TGs from the liver to adipose tissue and other metabolic tissues (Heeren & Scheja, 2021). The excess hepatic TGs induced by MiR20b were presumably transferred to epididymal adipose tissue, contributing to the increase in adipose tissue weight, while inguinal and brown adipose tissues were not significantly affected by MiR20b (Figure 5-figure supplement 3). Together, the fat mass measured by EchoMRI included intrahepatic and adipose TGs, and mirrored the increases shown in Figure 5D. In addition, MiR20b induced the expression of hepatic DGAT1, which could explain increased TG secretion through VLDLs (Figure 4C) (Alves- Bezerra & Cohen, 2017; Liang et al., 2004).

      Conversely, the supply of FFAs from adipose tissue might have contributed to hepatic steatosis. However, we observed that there were no significant changes in the expression of Mir20b and lipolytic genes in adipose tissue (Figure 5-figure supplement 4A). Furthermore, the serum FFA levels in the AAV-Control and AAV-Mir20b groups under the HFD were comparable (Figure 5-figure supplement 4B). These findings suggested that increased intrahepatic TG levels constituted the specific and primary effect of AAV-Mir20b.

      12) In figure 5 S1, the anti-miR20b substantially reduces the weights of multiple tissues in mice fed a HFD, given this, why does overall body weight (figure 5c) show such a modest difference. Figure 5 E and F also suggest that the overall weights would have been lower than shown in Figure 5C. In the end, instead of bar graphs of the final weights, the entire weight curve for the mice fed the HFD should have been shown.

      Thank you for your comment. To make our results more robust, we increased the sample size (n = 10). Moreover, we provided the entire weight curve and revised the results (Figure 6C). AAV-anti-Mir20b treatment significantly reduced the liver weight (Figure 6F). The weight of adipose tissue, including epididymal white adipose tissue (EpiWAT), tended to decrease; however, the difference was not significant (Figure 6-figure supplement 3). As indicated in a previous question (#11), the change in hepatic TG levels could affect the weight of other tissues. In our revised Figure 6C, we show that the overall weight change might be higher than the sum of weight change of specific metabolic tissues, such as the liver and adipose tissues.

      13) How well were the NAFLD vs. normal GSE individuals matched? This is very important, since PPARalpha emerges from comparing these data sets. Matching is very important to make sure that the differences in NR expression does not stem from a confound that went along win parallel with the NAFLD cohort vs. the normal GSE cohort.

      This is a very good point. PPARA emerged from regulatory network analysis (Figure 1A) and was selected as target of MIR20B through the analysis of RNA-seq data from MIR20B- overexpressing HepG2 cells (Figure 2). By constructing a regulatory network in NAFLD patients, we determined that MIR20B is responsible for NR regulation in NAFLD. As shown in Figure 1A, we analyzed the differential expression of NR in NAFLD using public GSE data (GSE130970) consisting of patients with NAFLD and age- and weight-matched normal controls (Hoang et al., 2019). To verify the expression of MIR20B, we assessed the miRNA levels in another non-coding RNA GSE dataset (GSE40744) in the original manuscript (previous Figure 1B). However, in the process of reviewing GSE40744 patients’ information with physicians, we found that some of the patients were virus-infected. Thus, we removed the data from GSE40744 and truly apologize for the confusion.

      In the revised manuscript (page 16 line 303-304), we examined the expression of MIR20B and other candidate miRNAs such as MIR129 and MIR219A in patient samples from the Asan Medical Center (Seoul, Republic of Korea), who were diagnosed by pathologists and age- and weight-matched. As shown in Figure 1B, MIR20B is one of the main miRNAs involved in NAFLD progression. In addition, the expression of PPARA was significantly negatively correlated with that of MIR20B (Figure 2-figure supplement 3).

      Reviewer #3:

      In this manuscript, Le et al. use an elegant combination of cultured cells, patient samples, and mouse models to show that miR-20b promotes non-alcoholic fatty liver disease (NAFLD) by suppressing PPAR-alpha. The authors show that miR-20b inhibits PPAR-gamma expression, resulting in reduced fatty acid oxidation, decreased mitochondrial biogenesis, and increased hepatocyte lipid accumulation both in vitro and in vivo. Inhibition of miR-20b in mouse NAFLD models leads to increased PPAR-gamma, reduced hepatic lipid accumulation, decreased inflammation, and improved glucose tolerance. Overall, the data are well-controlled and support the authors' conclusions.

      Strengths:

      1) In Figure 1, the authors show miR-20b is increased in NAFLD patients, mouse obesity/NAFLD models, and cultured liver cancer cells treated with oleic acid (OA). The use of multiple complementary approaches is very powerful, although more information regarding the diagnoses in the 13 patient samples would be helpful (see below).

      Thank you for your comment. Accordingly, we have included the patient information in a table (Figure 1-figure supplement 1A, B). To increase the statistical power and prevent confounding effects, we increased the number of samples and tried to match them to compare age, weight, and male/female ratio between the groups. Due to the limited number of patient samples, the cohorts could not be perfectly matched. Nevertheless, there were no significant differences in age and male/female ratio among the three groups. Specifically, serum AST, ALT, and fasting glucose levels were significantly increased with progression from normal to non-alcoholic steatohepatitis (NASH), but total cholesterol was comparable as previously reported (Chung et al., 2020). We have revised the text in page 7-8, line 118- 130.

      2) In Figure 2, the authors show that PPAR-alpha is a direct target of miR-20b. These data include a luciferase reporter assay regulated by the 3'UTR of PPAR-alpha. Importantly, when the 3'UTR is mutated, suppression of luciferase expression by miR-20b is no longer observed. The authors use multiple different algorithms to predict miR-20b targets, look for overlap, and then confirm PPAR-alpha as the most important "hit" in vitro.

      3) Figure 3 highlights changes in fatty acid metabolism in HepG2 cells transfected with miR-20b, miR-NC, or anti-miR-20b and treated with oleic acid. Figure 3, supplement 4 shows that anti-miR-20b can alleviate OA-induced hepatic steatosis in both HepG2 cells and primary hepatocytes. The use of another (primary) cell line here is important, because HepG2 is a liver cancer cell line, and metabolic changes in HepG2 cells might not be representative of non-neoplastic hepatocytes.

      4) In Figure 4, the authors show that miR-20b promotes hepatic steatosis, increases liver weight, increases liver injury markers, and impairs glucose tolerance and insulin sensitivity in HFD-fed mice. Conversely, anti-miR-20b inhibits hepatic steatosis, decreases liver weight and liver injury markers, and improves glucose tolerance and insulin sensitivity in HFD-fed mice (Figure 5). Anti-miR-20b also inhibits hepatic steatosis and fibrosis and decreases liver injury markers in MCD-fed mice (Figure 8). These in vivo studies provide excellent support for the authors' hypothesis regarding the role of miR-20b in promoting fatty liver disease. The liver readily takes up small nucleic acids, including miRs and anti-miRs. Thus, the possibility of using anti-miR-20b as a therapeutic for fatty liver disease is intriguing, and supported by these experiments.

      5) In Figure 6, in HepG2 cells, the authors demonstrate that PPAR-alpha overexpression (or to a lesser extent fenofibrate treatment) is able to rescue the transcriptional effects of miR-20b overexpression. Conversely, siPPAR-alpha can rescue the transcriptional effects of anti-miR-20b. Similar results are shown in Figure 7-fenofibrate is able to at least partially suppress some of the metabolic phenotypes that are exacerbated by miR-20b overexpression in HFD-fed mice (the decreased lean/BW ratio, elevated fasting glucose, some transcriptional changes). Again, it is nice to see that the in vitro data is supported by in vivo results.

      Thank you for your comments.

      Weaknesses:

      1) In Figure 3, figure supplement 2, it seems the effects of miR-20b overexpression in primary hepatocytes may be a bit overstated. While it does seem that miR-20b enhances the accumulation of fat in primary hepatocytes upon OA treatment, miR-20b overexpression alone does not seem to have significant effects on steatosis (A), cholesterol (B), or triglycerides (C).

      Thank you for your comment. We have revised the text; “Unlike in HepG2 cells (Figure 2A-C), MIR20B alone did not induce lipid accumulation in primary hepatocytes without OA treatment, but MIR20B significantly increased lipid accumulation in the presence of OA (Figure 4-figure supplement 2)” (page 19, line 383-385). “Figure 4-figure supplement 2. MIR20B enhances lipid accumulation in primary hepatocytes under OA-treatment” (the title of Figure 4-figure supplement 2)

      2) Histologic analysis of mouse liver samples by a pathologist is lacking. In Figure 4, is there increased inflammation and/or fibrosis with miR-20b overexpression, or just increased steatosis? In Figure 4 and Figure 8, it would be helpful if steatosis, fibrosis, and inflammation were quantified/scored histologically.

      Thank you for your comment. Accordingly, we have conducted histological analysis and measured the NAFLD activity score (NAS) and fibrosis score by a pathologist. We have added the scoring graphs in Figure 5H, 6H, 7H, 8I, 8J, 9G, and 9H. In Figure 5G and 5H, AAV-Mir20b significantly increased steatosis but the increase of inflammation was not significant under the HFD; However, AAV-anti-Mir20b significantly decreased steatosis and inflammation, fibrosis under the MCD (Figure 8H-J). In addition, the combination of AAV-anti- Mir20b with fenofibrate significantly alleviated steatosis, inflammation, and fibrosis compared to AAV-Control under the MCD (Figure 9F-H).

      3) The effects of anti-miR-20b on hepatic triglycerides and inflammatory markers in vivo are modest (Figures 5 and 8). Perhaps an enhancement could be seen by combining anti-miR-20b with fenofibrate. While the authors show that fenofibrate's effects are suppressed with miR-20b overexpression, they don't examine what happens when fenofibrate is combined with anti-miR-20b. To me, this experiment is critical to determine if PPAR-alpha activity could be further maximized to combat NAFLD (beyond what is seen with fenofibrate alone).

      This is a very good point. Accordingly, we performed a new experiment in which fenofibrate was combined with anti-Mir20b to treat MCD-fed mice. The combination showed further improvements compared with those obtained by fenofibrate treatment alone. The results have been described in page 23-24, line 518-536.

      “Recently, drug development strategies for NAFLD/NASH are moving toward combination therapies (Dufour, Caussy, & Loomba, 2020). However, the efficacy of developing drugs, including fenofibrate, against NAFLD/NASH is limited (Fernandez-Miranda et al., 2008). Thus, we tested whether the combination of anti-Mir20b and fenofibrate would improve NAFLD in MCD-fed mice. The levels of hepatic Mir20b were reduced after administration of AAV-anti-Mir20b in MCD-fed mice compared to those in mice administered with AAV-Control, and this reduction was also observed after fenofibrate treatment (Figure 9A). Interestingly, the combination of AAV-anti-Mir20b and fenofibrate increased the levels of PPARα to a greater extent than AAV-Mir20b alone (Figure 9B, C). AAV-anti-Mir20b or fenofibrate administration significantly reduced the liver weight and hepatic TG levels, and co- administration further reduced hepatic steatosis (Figure 9D, E). Histological sections showed that the combination of AAV-anti-Mir20b and fenofibrate improved NAFLD, as evidenced by the effects on both lipid accumulation and fibrosis in the liver (Figure 9F-H). Consistently, the levels of AST and ALT were significantly lower after combined treatment with AAV-anti- Mir20b and fenofibrate than after a single treatment (Figure 9I, J). In addition, the expression of genes related to hepatic inflammation, such as Tnf and Il6 (Figure 9K), and fibrosis, such as Acta2, Col1a1, Fn, and Timp1, (Figure 9L), was further decreased by the combination of AAV-anti-Mir20b and fenofibrate. These results suggest that AAV-anti-Mir20b may increase the efficacy of fenofibrate, especially its effect on fibrosis, and provide a more effective option for improving NAFLD/NASH."

    1. Author Response

      Reviewer #1 (Public Review):

      This work introduces a novel framework for evaluating the performance of statistical methods that identify replay events. This is challenging because hippocampal replay is a latent cognitive process, where the ground truth is inaccessible, so methods cannot be evaluated against a known answer. The framework consists of two elements:

      1) A replay sequence p-value, evaluated against shuffled permutations of the data, such as radon line fitting, rank-order correlation, or weighted correlation. This element determines how trajectory-like the spiking representation is. The p-value threshold for all accepted replay events is adjusted based on an empirical shuffled distribution to control for the false discovery rate.

      2) A trajectory discriminability score, also evaluated against shuffled permutations of the data. In this case, there are two different possible spatial environments that can be replayed, so the method compares the log odds of track 1 vs. track 2.

      The authors then use this framework (accepted number of replay events and trajectory discriminability) to study the performance of replay identification methods. They conclude that sharp wave ripple power is not a necessary criterion for identifying replay event candidates during awake run behavior if you have high multiunit activity, a higher number of permutations is better for identifying replay events, linear Bayesian decoding methods outperform rank-order correlation, and there is no evidence for pre-play.

      The authors tackle a difficult and important problem for those studying hippocampal replay (and indeed all latent cognitive processes in the brain) with spiking data: how do we understand how well our methods are doing when the ground truth is inaccessible? Additionally, systematically studying how the variety of methods for identifying replay perform, is important for understanding the sometimes contradictory conclusions from replay papers. It helps consolidate the field around particular methods, leading to better reproducibility in the future. The authors' framework is also simple to implement and understand and the code has been provided, making it accessible to other neuroscientists. Testing for track discriminability, as well as the sequentiality of the replay event, is a sensible additional data point to eliminate "spurious" replay events.

      However, there are some concerns with the framework as well. The novelty of the framework is questionable as it consists of a log odds measure previously used in two prior papers (Carey et al. 2019 and the authors' own Tirole & Huelin Gorriz, et al., 2022) and a multiple comparisons correction, albeit a unique empirical multiple comparisons correction based on shuffled data.

      With respect to the log odds measure itself, as presented, it is reliant on having only two options to test between, limiting its general applicability. Even in the data used for the paper, there are sometimes three tracks, which could influence the conclusions of the paper about the validity of replay methods. This also highlights a weakness of the method in that it assumes that the true model (spatial track environment) is present in the set of options being tested. Furthermore, the log odds measure itself is sensitive to the defined ripple or multiunit start and end times, because it marginalizes over both position and time, so any inclusion of place cells that fire for the animal's stationary position could influence the discriminability of the track. Multiple track representations during a candidate replay event would also limit track discriminability. Finally, the authors call this measure "trajectory discriminability", which seems a misnomer as the time and position information are integrated out, so there is no notion of trajectory.

      The authors also fail to make the connection with the control of the false discovery rate via false positives on empirical shuffles with existing multiple comparison corrections that control for false discovery rates (such as the Benjamini and Hochberg procedure or Storey's q-value). Additionally, the particular type of shuffle used will influence the empirically determined p-value, making the procedure dependent on the defined null distribution. Shuffling the data is also considerably more computationally intensive than the existing multiple comparison corrections.

      Overall, the authors make interesting conclusions with respect to hippocampal replay methods, but the utility of the method is limited in scope because of its reliance on having exactly two comparisons and having to specify the null distribution to control for the false discovery rate. This work will be of interest to electrophysiologists studying hippocampal replay in spiking data.

      We would like to thank the reviewer for the feedback.

      Firstly, we would like to clarify that it is not our intention to present this tool as a novel replay detection approach. It is indeed merely a novel tool for evaluating different replay detection methods. Also, while we previously used log odds metrics to quantify contextual discriminability within replay events (Tirole et al., 2021), this framework is novel in how it is used (to compare replay detection methods), and the use of empirically determined FPR-matched alpha levels. We have now modified the manuscript to make this point more explicit.

      Our use of the term trajectory-discriminability is now changed to track-discriminability in the revised manuscript, given we are summing over time and space, as correctly pointed out by the reviewer.

      While this approach requires two tracks in its current implementation, we have also been able to apply this approach to three tracks, with a minor variation in the method, however this is beyond the scope of our current manuscript. Prior experience on other tracks not analysed in the log odds calculation should not pose any issue, given that the animal likely replays many experiences of the day (e.g. the homecage). These “other” replay events likely contribute to candidate replay events that fail to have a statistically significant replay score on either track.

      With regard to using a cell-id randomized dataset to empirically estimate false-positive rates, we have provided a detailed explanation behind our choice of using an alpha level correction in our response to the essential revisions above. This approach is not used to examine the effect of multiple comparisons, but rather to measure the replay detection error due to non-independence and a non-uniform p value distribution. Therefore we do not believe that existing multiple comparison corrections such as Benjamini and Hochberg procedure are applicable here (Author response image 1-3). Given the potential issues raised with a session-based cell-id randomization, we demonstrate above that the null distribution is sufficiently independent from the four shuffle-types used for replay detection (the same was not true for a place field randomized dataset) (Author response image 4).

      Author response image 1.

      Distribution of Spearman’s rank order correlation score and p value for false events with random sequence where each neuron fires one (left), two (middle) or three (right) spikes.

      Author response image 2.

      Distribution of Spearman’s rank order correlation score and p value for mixture of 20% true events and 80% false events where each neuron fires one (left), two (middle) or three (right) spikes.

      Author response image 3.

      Number of true events (blue) and false events (yellow) detected based on alpha level 0.05 (upper left), empirical false positive rate 5% (upper right) and false discovery rate 5% (lower left, based on BH method)

      Author response image 4.

      Proportion of false events detected when using dataset with within and cross experiment cell-id randomization and place field randomization. The detection was based on single shuffle including time bin permutation shuffle, spike train circular shift shuffle, place field circular shift shuffle, and place bin circular shift shuffle.

      Reviewer #2 (Public Review):

      This study proposes to evaluate and compare different replay methods in the absence of "ground truth" using data from hippocampal recordings of rodents that were exposed to two different tracks on the same day. The study proposes to leverage the potential of Bayesian methods to decode replay and reactivation in the same events. They find that events that pass a higher threshold for replay typically yield a higher measure of reactivation. On the other hand, events from the shuffled data that pass thresholds for replay typically don't show any reactivation. While well-intentioned, I think the result is highly problematic and poorly conceived.

      The work presents a lot of confusion about the nature of null hypothesis testing and the meaning of p-values. The prescription arrived at, to correct p-values by putting animals on two separate tracks and calculating a "sequence-less" measure of reactivation are impractical from an experimental point of view, and unsupportable from a statistical point of view. Much of the observations are presented as solutions for the field, but are in fact highly dependent on distinct features of the dataset at hand. The most interesting observation is that despite the existence of apparent sequences in the PRE-RUN data, no reactivation is detectable in those events, suggesting that in fact they represent spurious events. I would recommend the authors focus on this important observation and abandon the rest of the work, as it has the potential to further befuddle and promote poor statistical practices in the field.

      The major issue is that the manuscript conveys much confusion about the nature of hypothesis testing and the meaning of p-values. It's worth stating here the definition of a p-value: the conditional probability of rejecting the null hypothesis given that the null hypothesis is true. Unfortunately, in places, this study appears to confound the meaning of the p-value with the probability of rejecting the null hypothesis given that the null hypothesis is NOT true-i.e. in their recordings from awake replay on different mazes. Most of their analysis is based on the observation that events that have higher reactivation scores, as reflected in the mean log odds differences, have lower p-values resulting from their replay analyses. Shuffled data, in contrast, does not show any reactivation but can still show spurious replays depending on the shuffle procedure used to create the surrogate dataset. The authors suggest using this to test different practices in replay detection. However, another important point that seems lost in this study is that the surrogate dataset that is contrasted with the actual data depends very specifically on the null hypothesis that is being tested. That is to say, each different shuffle procedure is in fact testing a different null hypothesis. Unfortunately, most studies, including this one, are not very explicit about which null hypothesis is being tested with a given resampling method, but the p-value obtained is only meaningful insofar as the null that is being tested and related assumptions are clearly understood. From a statistical point of view, it makes no sense to adjust the p-value obtained by one shuffle procedure according to the p-value obtained by a different shuffle procedure, which is what this study inappropriately proposes. Other prescriptions offered by the study are highly dataset and method dependent and discuss minutiae of event detection, such as whether or not to require power in the ripple frequency band.

      We would like to thank the reviewer for their feedback. The purpose of this paper is to present a novel tool for evaluating replay sequence detection using an independent measure that does not depend on the sequence score. As the reviewer stated, in this study, we are detecting replay events based on a set alpha threshold (0.05), based on the conditional probability of rejecting the null hypothesis given that the null hypothesis is true. For all replay events detected during PRE, RUN or POST, they are classified as track 1 or track 2 replay events by comparing each event’s sequence score relative to the shuffled distribution. Then, the log odds measure was only applied to track 1 and track 2 replay events selected using sequence-based detection. Its important to clarify that we never use log odds to select events to examine their sequenceness p value. Therefore, we disagree with the reviewer’s claim that for awake replay events detected on different tracks, we are quantifying the probability of rejecting the null hypothesis given that the null hypothesis is not true.

      However, we fully understand the reviewer’s concerns with a cell-id randomization, and the potential caveats associated with using this approach for quantifying the false positive rate. First of all, we would like to clarify that the purpose of alpha level adjustment was to facilitate comparison across methods by finding the alpha level with matching false-positive rates determined empirically. Without doing this, it is impossible to compare two methods that differ in strictness (e.g. is using two different shuffles needed compared to using a single shuffle procedure). This means we are interested in comparing the performance of different methods at the equivalent alpha level where each method detects 5% spurious events per track rather than an arbitrary alpha level of 0.05 (which is difficult to interpret if statistical tests are run on non-independent samples). Once the false positive rate is matched, it is possible to compare two methods to see which one yields more events and/or has better track discriminability.

      We agree with the reviewer that the choice of data randomization is crucial. When a null distribution of a randomized dataset is very similar to the null distribution used for detection, this should lead to a 5% false positive rate (as a consequence of circular reasoning). In our response to the essential revisions, we have discussed about the effect of data randomization on replay detection. We observed that while place field circularly shifted dataset and cell-id randomized dataset led to similar false-positive rates when shuffles that disrupt temporal information were used for detection, a place field circularly shifted dataset but not a cell-id randomized dataset was sensitive to shuffle methods that disrupted place information (Author response image 4). We would also like to highlight one of our findings from the manuscript that the discrepancy between different methods can be substantially reduced when alpha level was adjusted to match false-positive rates (Figure 6B). This result directly supports the utility of a cell-id randomized dataset in finding the alpha level with equivalent false positive rates across methods. Hence, while imperfect, we argue cell-id randomization remains an acceptable method as it is sufficiently different from the four shuffles we used for replay detection compared to place field randomized dataset (Author response image 4).

      While the use of two linear tracks was crucial for our current framework to calculate log odds for evaluating replay detection, we acknowledge that it limits the applicability of this framework. At the same time, the conclusions of the manuscript with regard to ripples, replay methods, and preplay should remain valid on a single track. A second track just provides a useful control for how place cells can realistically remap within another environment. However, with modification, it may be applied to a maze with different arms or subregions, although this is beyond the scope of our current study.

      Last of not least, we partly agree with the reviewer that the result can be dataset-specific such that the result may vary depending on animal’s behavioural state and experimental design. However, our results highlight the fact that there is a very wide distribution of both the track discriminability and the proportion of significant events detected across methods that are currently used in the field. And while we see several methods that appear comparable in their effectiveness in replay detection, there are also other methods that are deeply flawed (that have been previously been used in peer-reviewed publications) if the alpha level is not sufficiently strict. Regardless of the method used, most methods can be corrected with an appropriate alpha level (e.g. using all spikes for a rank order correlation). Therefore, while the exact result may be dataset-specific, we feel that this is most likely due to the number of cells and properties of the track more than the use of two tracks. Reporting of the empirically determined false-positive rate and use of alpha level with matching false-positive rate (such as 0.05) for detection does not require a second track, and the adoption of this approach by other labs would help to improve the interpretability and generalizability of their replay data.

      Reviewer #3 (Public Review):

      This study tackles a major problem with replay detection, which is that different methods can produce vastly different results. It provides compelling evidence that the source of this inconsistency is that biological data often violates assumptions of independent samples. This results in false positive rates that can vary greatly with the precise statistical assumptions of the chosen replay measure, the detection parameters, and the dataset itself. To address this issue, the authors propose to empirically estimate the false positive rate and control for it by adjusting the significance threshold. Remarkably, this reconciles the differences in replay detection methods, as the results of all the replay methods tested converge quite well (see Figure 6B). This suggests that by controlling for the false positive rate, one can get an accurate estimate of replay with any of the standard methods.

      When comparing different replay detection methods, the authors use a sequence-independent log-odds difference score as a validation tool and an indirect measure of replay quality. This takes advantage of the two-track design of the experimental data, and its use here relies on the assumption that a true replay event would be associated with good (discriminable) reactivation of the environment that is being replayed. The other way replay "quality" is estimated is by the number of replay events detected once the false positive rate is taken into account. In this scheme, "better" replay is in the top right corner of Figure 6B: many detected events associated with congruent reactivation.

      There are two possible ways the results from this study can be integrated into future replay research. The first, simpler, way is to take note of the empirically estimated false positive rates reported here and simply avoid the methods that result in high false positive rates (weighted correlation with a place bin shuffle or all-spike Spearman correlation with a spike-id shuffle). The second, perhaps more desirable, way is to integrate the practice of estimating the false positive rate when scoring replay and to take it into account. This is very powerful as it can be applied to any replay method with any choice of parameters and get an accurate estimate of replay.

      How does one estimate the false positive rate in their dataset? The authors propose to use a cell-ID shuffle, which preserves all the firing statistics of replay events (bursts of spikes by the same cell, multi-unit fluctuations, etc.) but randomly swaps the cells' place fields, and to repeat the replay detection on this surrogate randomized dataset. Of course, there is no perfect shuffle, and it is possible that a surrogate dataset based on this particular shuffle may result in one underestimating the true false positive rate if different cell types are present (e.g. place field statistics may differ between CA1 and CA3 cells, or deep vs. superficial CA1 cells, or place cells vs. non-place cells if inclusion criteria are not strict). Moreover, it is crucial that this validation shuffle be independent of any shuffling procedure used to determine replay itself (which may not always be the case, particularly for the pre-decoding place field circular shuffle used by some of the methods here) lest the true false-positive rate be underestimated. Once the false positive rate is estimated, there are different ways one may choose to control for it: adjusting the significance threshold as the current study proposes, or directly comparing the number of events detected in the original vs surrogate data. Either way, with these caveats in mind, controlling for the false positive rate to the best of our ability is a powerful approach that the field should integrate.

      Which replay detection method performed the best? If one does not control for varying false positive rates, there are two methods that resulted in strikingly high (>15%) false positive rates: these were weighted correlation with a place bin shuffle and Spearman correlation (using all spikes) with a spike-id shuffle. However, after controlling for the false positive rate (Figure 6B) all methods largely agree, including those with initially high false positive rates. There is no clear "winner" method, because there is a lot of overlap in the confidence intervals, and there also are some additional reasons for not overly interpreting small differences in the observed results between methods. The confidence intervals are likely to underestimate the true variance in the data because the resampling procedure does not involve hierarchical statistics and thus fails to account for statistical dependencies on the session and animal level. Moreover, it is possible that methods that involve shuffles similar to the cross-validation shuffle ("wcorr 2 shuffles", "wcorr 3 shuffles" both use a pre-decoding place field circular shuffle, which is very similar to the pre-decoding place field swap used in the cross-validation procedure to estimate the false positive rate) may underestimate the false positive rate and therefore inflate adjusted p-value and the proportion of significant events. We should therefore not interpret small differences in the measured values between methods, and the only clear winner and the best way to score replay is using any method after taking the empirically estimated false positive rate into account.

      The authors recommend excluding low-ripple power events in sleep, because no replay was observed in events with low (0-3 z-units) ripple power specifically in sleep, but that no ripple restriction is necessary for awake events. There are problems with this conclusion. First, ripple power is not the only way to detect sharp-wave ripples (the sharp wave is very informative in detecting awake events). Second, when talking about sequence quality in awake non-ripple data, it is imperative for one to exclude theta sequences. The authors' speed threshold of 5 cm/s is not sufficient to guarantee that no theta cycles contaminate the awake replay events. Third, a direct comparison of the results with and without exclusion is lacking (selecting for the lower ripple power events is not the same as not having a threshold), so it is unclear how crucial it is to exclude the minority of the sleep events outside of ripples. The decision of whether or not to select for ripples should depend on the particular study and experimental conditions that can affect this measure (electrode placement, brain state prevalence, noise levels, etc.).

      Finally, the authors address a controversial topic of de-novo preplay. With replay detection corrected for the false positive rate, none of the detection methods produce evidence of preplay sequences nor sequenceless reactivation in the tested dataset. This presents compelling evidence in favour of the view that the sequence of place fields formed on a novel track cannot be predicted by the sequential structure found in pre-task sleep.

      We would like to thank the reviewer for the positive and constructive feedback.

      We agree with the reviewer that the conclusion about the effect of ripple power is dataset-specific and is not intended to be a one-size-fit-all recommendation for wider application. But it does raise a concern that individual studies should address. The criteria used for selecting candidate events will impact the overall fraction of detected events, and makes the comparison between studies using different methods more difficult. We have updated the manuscript to emphasize this point.

      “These results emphasize that a ripple power threshold is not necessary for RUN replay events in our dataset but may still be beneficial, as long as it does not excessively eliminate too many good replay events with low ripple power. In other words, depending on the experimental design, it is possible that a stricter p-value with no ripple threshold can be used to detect more replay events than using a less strict p-value combined with a strict ripple power threshold. However, for POST replay events, a threshold at least in the range of a z-score of 3-5 is recommended based on our dataset, to reduce inclusion of false-positives within the pool of detected replay events.”

      “We make six key observations: 1) A ripple power threshold may be more important for replay events during POST compared to RUN. For our dataset, the POST replay events with ripple power below a z-score of 3-5 were indistinguishable from spurious events. While the exact ripple z-score threshold to implement may differ depending on the experimental condition (e.g. electrode placement, behavioural paradigm, noise level and etc) and experimental aim, our findings highlight the benefit of using ripple power threshold for detecting replay during POST. 2) ”

    1. Author Response:

      Evaluation Summary:

      This manuscript addresses a phenomenon of great interest to researchers in cell metabolism and cancer biology: namely, why do cancer cells often secrete high levels of lactate, despite the presence of abundant oxygen to power nutrient oxidation (Warburg effect). The authors propose that lactate export and subsequent extracellular acidification provides a selective advantage and the concomitant rise in intracellular pH is sufficient to drive flux through glycolysis, thereby sustaining the Warburg effect. This is an intriguing hypothesis that ties together many published observations, but it would require further support both from the technical and conceptual side.

      The concept proposed in the evaluation summary is not quite correct, in this paper we have tried to show that it is not lactate export that drives extracellular acidification, but that cells which can increase proton export, via over-expression or increased activity of proton exporting proteins, can subsequently drive upregulation of glycolysis and increased lactate production, likely due to increased intracellular pH (pHi) and the ability of glycolytic enzymes to have enhanced activity under slightly higher pHi. As mentioned in the summary, although some of these observations are known, the novelty lies in that they have not been directly proven by inducing acid export prior to a glycolytic phenotype, we believe showing the casual nature of proton export on glycolysis is the novelty of this research.

      Reviewer #1 (Public Review):

      In this manuscript, the authors tackle an interesting puzzle: why do cancer cells secrete most of their glucose as lactate? The authors propose that acid export is sufficient to enhance glycolysis and provide a selective advantage to cancer cells growing in vivo. To this end, the authors show that clonal lines expressing CA-IX or PMA1, each of which will facilitate proton export, have elevated capacity to acidify extracellular medium and can drive increased migration/invasion and tumor growth or metastases. In support of the model that extracellular pH is a key driver of metastases, the effect of CA-IX expression on lung metastases is reversed following bicarbonate treatment. While many of the individual conclusions of the manuscript are not novel-for example, pH has been reported to control glycolysis and it is established that CA-IX expression modulates migration/metastases-providing a comprehensive assessment of the ability of proton export to drive the Warburg effect, and assessing the significance of metabolic rewiring driven by acid export on tumor growth, would represent an important resource for researchers intrigued by the pervasive observation that cancer cells secrete lactate despite potential bioenergetic disadvantages of discarding biomass.

      The strength of the manuscript lies therefore in tying these disparate observations together in a coherent model and testing the role of acid export per se on glycolytic flux. The technical weaknesses of the paper prevent such coherent model building. A major concern is that all cell lines appear to be generated by transient transfection followed by clonal selection, giving rise to cells with notable variability and inconsistent phenotypes. More traditional approaches to manipulate enzyme expression will provide more robust model systems to test the proposed model. Similarly, direct measures of glycolytic flux are required to make conclusions about the role of acid export in promoting glycolysis. Another strength is the use of heterologous enzyme systems to alter proton export in cancer cells, but alternative explanations for these results are not fully considered. Ultimately, to what extent acid export per se, as opposed to altered metabolism driven by acid export, drives enhanced tumor metastases is not addressed.

      We agree wholly with Reviewer 1 that although individual components of this manuscript have previously been implicated in cancer research, the novelty lies in directly assessing metabolic changes, specifically the Warburg effect, as a result of proton production to determine causality rather than correlation as previous studies have shown. The reviewer makes a valid point about our use of clones and this is something we considered at length. When originally designing these experiments, we had many conversations within our lab and with collaborators and colleagues, and the overall consensus was that bulk populations are more likely to have heterogeneous expression levels unrelated to transfection, which could result in the phenotype generated being noisy and not indicative of what occurs when proton exporters are over-expressed. We chose to isolate single clones, maintaining these in antibiotic selection media, to ensure stable over-expression. After confirming over-expression, cells were grown without antibiotics and screened regularly for maintenance of protein expression. This was also one of the reasons why we utilized over-expression of two different proton exporters in multiple different cell lines to be confident that proton export was changing the metabolic phenotype and not just due to changes in an individual isolated clonal line. We utilized bulk population for the MOCK clones, to ensure we weren’t selecting for a clone which had inherently different metabolic traits from the parental population. As described in the paper, while some of the behaviors of the different clones are indeed divergent, the impact of expression on increased glucose uptake and lactate production is wholly consistent and highly correlated to expression of PMA1 or CA-IX. Although we utilized metabolic profiling, we do not claim to infer flux from these data. Flux was assessed via lactate production and glucose consumption rates. The metabolomic analyses showed that glycolytic intermediates upstream of Pyruvate Kinase (PK) were uniformly increased in transfectants. This was an unequivocal finding and, given the increased flux, we have concluded that transfection results in activating glycolytic enzymes upstream of PK. The pleiotropic nature of these effects have led us to propose that intracellular pH was increasing and likely enhancing glycolytic enzyme activity throughout the glycolytic pathway. We measured the intracellular pH and showed that it was generally elevated in the transfectants. Finally, the reviewer was concerned that we did not address the mechanism by which pH increases metastases. Such a study would be beyond the scope of this paper and, indeed, was the subject of a two-volume special issue of Cancer Mets. Rev. in 2019 (PMC6625888). Hence, in this paper, we were not trying to address the mechanism by which pH affects metastasis, but simply wanted to show additional biological relevance.

      Reviewer #2 (Public Review):

      The work by Xu et al proposes that the Warburg effect - the increase of glycolytic metabolism usually displayed by tumor cells, is driven by increased proton excretion rather than by oncogenic dysregulation of glycolytic enzyme levels. As a proof-of-principle, they engineered tumor cells to increase proton excretion. They observed an increase in glycolytic rate, pH, and malignancy in their engineered cells.

      1. My main issue with this work is that I do not agree with the authors when they say that the "canonical view" is that oncolytic mutations are thought to drive the Warburg effect. What I understand the consensus to be, is that it is fast proliferating cells - rather than malignant cells - the ones who display this form of metabolism. The rationale is that glycolytic metabolism allows keeping biomass by redirecting lactate and from the phosphate pentose pathway. In contrast, the end product of oxidative phosphorylation is CO2 that cannot be further utilized in cell metabolism.

      They claim that they Vander Heiden et al., 2009 shows that "fermentation under aerobic conditions is energetically unfavorable and does not confer any clear evolutionary benefits." This is incorrect. While that review states that the Warburg effect has little effect on the ATP/ADP ratio, they do show this form of metabolism has significant benefits for fast proliferating cells. In fact, the whole review is about how the Warburg effect is a necessary metabolic adaptation for fast proliferation rather than a unique feature of malignant cells.

      1. Their main observation is not surprising. From a biochemical standpoint, protons are final product of glycolysis (from the production of lactic acid). Thus, by mass action, any mechanism to remove protons from the cell will result in accelerated glycolytic rate. Similarly, reducing intracellular pH will necessarily slow down LDHA's activity, which in turn will slow down pyruvate kinase and so on.

      2. Their experiments are conducted on transformed cells - that by definition - have oncogenic driver mutations. They should test the effect of proton exporter using primary non-transformed cells (fresh MEFs, immune cells, etc). I would expect that they will still see the increase in glycolysis in this case. And yet, I would still have my concerns I expressed in my previous point.

      3. The fact that they can accelerate the Warburg effect by increasing proton export does not mean is the mechanism used by tumor cells in patients or "the driver" of this effect. As I mentioned, their observation is expected by mass action but tumors that do not overexpress proton transporter may still drive their Warburg effect via oncogenic mutations. The biochemical need here is to increase the sources of biomass and redox potential and evolution will select for more glycolytic phenotypes.

      Comment 1: We disagree with the reviewer that the energetic demands of a faster proliferating cell drive glycolysis in order to produce the biomass needed for generation of new cells. Available evidence does not support this hypothesis. As the reviewer mentioned, there is a correlation between proliferation and aerobic glycolysis (i.e. if cells are stimulated to grow they will consume more glucose), and the same can be said for motility (i.e. more motile cells have higher aerobic glycolysis). This is also true for normal cells and tissues that exhibit high levels of aerobic glycolysis. We agree that glycolytic ATP generation is more rapid than oxidative phosphorylation and that this may confer some selective advantage for transporters, as we described in PMC4060846. Nonetheless, it is clear that under conditions of similar proliferation and motility, more aggressive cancer cells ferment glucose at much higher rates. However, correlations between neither proliferation nor motility are the “Warburg Effect” which is a higher rate of aerobic glycolysis in cancers, regardless of proliferation or migration. As we described in PMID 18523064, the prevailing view in the cancer literature is that the Warburg effect is driven by oncogenes (ras, myc), transcription factors (HIF) and tumor suppressors (p53/TIGAR) through increased expression of glycolytic enzymes. This assumes that expression levels drive flux which has not been proved empirically. In biochemical pathways, it is canon that flux is regulated by demand (e.g. ATP) or through some post-transcriptional control (e.g. pH). In Vander Heiden’s paper the steady state levels are reported of ATP/ADP ratios, not flux. The first paragraph of the intro has been modified to accommodate this concern.

      Comment 2: The fact that our results are not surprising is our major argument: i.e. that glycolytic flux can be enhanced by increasing the rate of H+ export. We saw an increase in intracellular pH (pHi), but our metabolomics data do not support a direct effect on LDHA or PK. Instead, we show that clones with higher pHi have a crossover point at PK, due to reduced inhibition of upstream enzymes which is not there in clones at lower pHi.

      Comment 3: We agree it would be interesting to study the effects of proton export on immune cells especially given the increase in immunotherapy use in cancer treatment. We did utilize HEK 293 cells shown in supplemental figure S6, to show this was not a cancer cell line specific phenomenon, and we saw increased aerobic glycolysis with over-expression of CA-IX.

      Comment 4: We agree that oncogenic mutations can alter glycolytic rate, but we observed that increased expression and activity of proton exporters is sufficient to drive a Warburg effect. Although the reviewer indicates that glycolysis is responsible for generating the biomass needed for these faster proliferating cells, we have shown that proton exporter driven aerobic glycolysis does not increase proliferation rates. The literature, see Vander Heiden’s paper below, suggests that amino acids, mainly glutamine, can support the majority of biomass needs of a proliferating cell. Hence, reliance on aerobic glycolysis remains energetically inefficient and inefficient in that most of the carbons are removed, and thus will not be selected by evolution.

      Hosios, A.M., Hecht, V.C., Danai, L.V., Johnson, M.O., Rathmell, J.C., Steinhauser, M.L., Manalis, S.R., & Vander Heiden, M.G. (2016). Amino Acids Rather than Glucose Account for the Majority of Cell Mass in Proliferating Mammalian Cells. Developmental cell, 36 5, 540-9 .

      Reviewer #3 (Public Review):

      The authors claim that "proton export drives the Warburg effect". For this, they expressed proton-exporting proteins in cells and measured the intracellular proton concentration and the Warburg effect. Based on their data, however, I do not see elevated Warburg effect in these cells and thus conclude that the claim is not supported.

      The authors concluded that the CA-IX or PMA1 expressing cells had increased Warburg effect. I don't think this conclusion can be made based on the data presented. For the MCF-7 cells, the glucose consumption is ~18 pmol/cell/24hr (Fig. 5E) and lactate production is ~0.6 pmol/cell/24hr (Fig. 5F), indicating that 0.6/18/2 = 1.7% of the glucose is excreted as lactate. This low percentage remains true for the PMA1 expressing cells. For example, for the PMA1-C5 cells, the percentage of glucose going to lactate is about 1.8/38/2 = 2.4% (Fig. 5EF). While indeed there was an increase of both the glucose and lactate fluxes in the PMA1 expressing cells, the vast majority of the glucose flux ends up elsewhere likely the TCA cycle. This is a very different phenotype from cancer cells that have Warburg effect. The same calculation can be done for the CA-IX cells but the data on the glucose and lactate concentration there are inconsistent and expressed in confusing units (which I will elaborate in the next paragraph). Nevertheless, as there were at most a few folds of increase in lactate production flux in the M1 and M6 cells, the glucose flux going to lactate production is likely also a few percent of the total glucose uptake flux. Again, these cells do not really have Warburg effect.

      The glucose and lactate concentration data are key to the study. The data however appear to lack consistency. The lactate concentration data in Fig. 1F shows a ~5-fold increase in the M1 and M6 cells than the controls but the same data in S. Fig. 2 shows a mere ~50% increase. The meaning of the units on these figures is not clear. While "1 ng/ug protein" means 1ng of lactate is produced by 1 ug protein of cells over a 24 hour period, I do not understand what "ng/ul/ug protein" means (Fig. 1F). Also, "g/L/cell" must be a typo (S. Fig. 2). Furthermore, regarding the important glucose consumption flux, it is not clear why the authors did not directly measure it as they did for the PMA1 cells (Fig. 5E). Instead, they showed two indirect measurements which are not consistent with each other (Fig. 1E and S. Fig. 1).

      The reviewer pointed out discrepancies in our data and, upon reviewing, we have identified a dilution error leading to miscalculation of glucose consumption in Fig 5E. We have also repeated these experiments which agree with our re-calculation. Originally, it appeared from the data we presented that there was very little lactate flux, we have re-calculated the glucose excreted as lactate (average % using data from Fig. 5E and 5F) and present in a table below. We do believe we observed a Warburg effect in our proton exporting cells consistently. The reviewer points out that we utilized multiple methods to measure glycolysis in these cells leading to inconsistency, however we felt using multiple methods/instruments/kits to assess glucose consumption, lactate production, and glucose induced proton production rates was a strength of our findings as we consistently saw increased glycolysis in our proton exporting clones, irrespective of proton exporter, cell line, or method utilized. We are also not suggesting that glucose is solely being metabolized through glycolysis and do agree that it can metabolized through other metabolic pathways too such as TCA cycle, as the reviewer stated. The units used for these graphs are described in the methods and figure legends, in some assays such as Fig. 1F lactate was graphed as the ng of lactate per ul of cell culture media and then normalized per ug protein, which was determined by calculating the protein concentration of cells per well of the assay. Supplementary figure 2 has been re plotted per 10K cells to match other normalization values in the paper. Fig 1E and Fig. S1 are two different time points, M6 acidified media faster than M1 and this is likely why at 1 hour we are not yet seeing substantial increase in glucose uptake of M1.

    1. Author Response

      Reviewer #1 (Public Review):

      In the manuscript "Dnmt3a knockout impairs synapse maturation and is partly compensated by repressive modification H3K27me3," Li et al. investigate the role of Dnmt3a in the development of mouse cortical neurons by conditionally knocking it out during mid-late gestation and measuring the resulting molecular and phenotypic consequences. The study provides temporal context for Dnmt3a dependent DNA methylation in the development of a specific population of neurons and describes a potentially novel mode of compensatory histone trimethylation at H3K27 at particular genomic loci that lose DNA methylation. The authors first describe phenotypic aberrations induced by Dnmt3a-cko that include altered dendrite/spine morphology and deficits in particular social behaviors without overt morphological alterations in the brain. They then go on to describe the epigenomic landscape underlying their observations.

      While the study includes high quality data that are novel, there are a few caveats that need to be addressed. For example, while the manuscript does provide evidence to suggest there may be regions of the genome that are compensated by H3K27me3, the biological basis for this remains unclear, as do the consequences of this compensation. The behavioral data while providing a phenotype for the regulatory role of Dnmt3a in neuronal structure and function are not related in any particular way to the sequencing data. Overall the paper presents chromatin information with a more limited biological context.

      We thank the reviewer for appreciating the novelty and quality of our data. While we agree that key questions concerning the biological mechanism and significance of increased H3K27me3 remain, our study sets the stage for such investigation by providing a valid mouse model for excitatory neuron-specific loss of postnatal DNA methylation. Likewise, the behavioral studies we report do not exhaustively define the functional consequences of loss of Dnmt3a in pyramidal neurons, but they provide a foundation by defining the broad cognitive domains (working memory, social interaction) that are impacted. Importantly, our behavioral studies were also important to establish that many key cognitive functions (e.g. learning and memory) are largely preserved despite the massive disruption in epigenetic regulation of a large and critical population of cortical excitatory neurons. These mild behavioral deficits, together with the restricted transcriptional changes, point to some compensatory mechanism being turned on after the loss of Dnmt3a, which we proposed was due to H3K27me3 expansion.

      Reviewer #2 (Public Review):

      In this study, Li, Pinto-Duarte and colleagues investigate functional and epigenomic effects of loss of DNMT3A in excitatory neurons using a conditional knockout mouse model. The authors characterize behavioral, cell-morphological, and electrophysiological deficits that suggest disruption of synapse function may be major driver of phenotypes in these mice. Through RNAseq analysis of mutant neurons they identify 1720 dysregulated genes, some of which are implicated in dendritic and axonal development and synaptic formation. To understand the epigenetic factors underlying transcriptomic effects, the authors perform methylC-seq. They observe widespread reductions of mCG and mCH in mutant excitatory neurons and detect 141,633 differentially CG methylated regions (DMRs) which exhibit large reductions in mCG. To understand why sets of genes with widespread methylation depletion could be either up- or downregulated, the authors profiled histone modifications. They observe changes in H3K27me3 signal over development and increases in this mark at DMRs upon loss of DNMT3A. They suggest that over-compensation by H3K27me3 repression at genes containing DMRs may drive some of the downregulation of gene expression observed in DNMT3A mutant mice. These results confirm findings from previous publications on loss DNA methylation in DNMT3A conditional mutant mice and identify novel alterations in H3K27me3 that may impact changes in gene expression in these mutants.

      Understanding functional outcomes of DNMT3A loss and identifying mechanistic interplay between neuronal DNA methylation and other epigenetic mechanisms is of significant interest to the field. It has been clear that DNMT3A is critical to neuronal development, but cellular characterization such as spine morphology and synapse function has been limited. The analyses presented here provide robust evidence for synaptic alterations upon loss of DNMT3A. The authors' characterization of the differences in H3K27me3 across development and in the DNMT3A cKO underscores the potential importance of this mark when DNA methylation is altered.

      We would like to thank the Reviewer for their thoughtful assessment of the significance of our data and findings.

      While changes in H3K27me3 are relevant and are likely to be functionally important, the study has some limitations in assessing the magnitude and impact of these changes:

      1. Only two biological replicates per condition are included in most genomic analysis. This may lead to over-estimates of the changes observed due to sample-specific technical variation in the ChIP and sequencing procedures, particularly given the subtle alterations that are identified.

      We appreciate the reviewer’s concern regarding the number of biological replicates, which are critical for ensuring the reproducibility of our findings in independent animals. To reduce variability due to individual differences, the majority of our sequencing data come from tissue samples pooled from two mice. The only exception is MethylC-seq data from P0 mice, where we have 6 control and 2 cKO samples that each came from one individual. This information is now included in the “num_pooled_animals” column in Supplementary Table 1. We have added additional analyses showing the strong consistency of our results across biological replicates for RNA-seq (Figure S8A), MethylC-seq (Figure S10A), and ChIP-seq (Figure S19).

      In addition, the current resubmission includes new datasets from two new replicates for both RNA-seq and MethylC-Seq. These data are highly consistent with the previous findings. For example, for the 70 genes which are found to be differentially expressed (FDR < 0.05) in our new batch of RNA-seq data, 53 (75.7%) showed the same direction of expression change (up- or down-regulation) in the previous batch (Fig. R1):

      Fig. R1: Scatter to show the consistency of gene expression fold-changes (Dnmt3a cKO vs. control) across the two batches of RNA-seq samples using significant DE genes detected in the new batch (left) and significant DE genes detected in the old batch (right).

      1. While the compensatory mechanism proposed is feasible in light of the findings presented, evidence definitively supporting H3K27me3 changes as truly compensatory for loss of mCG in DNMT3A conditional knockout neurons is limited. Additional genomic analyses or experimental evidence would be needed to authoritatively make this claim.

      We agree that definitively establishing a causal role for the histone methylation changes in compensating for the loss of DNMT3A would require additional experiments, such as manipulation of histone methyltransferases. Such experiments are beyond the scope of this study. We have revised the manuscript to acknowledge this limitation and more clearly state the nature of our conclusions:

      "Overall, our results suggest that when DNA methylation is disrupted, H3K27me3 might partially compensate for the loss of mCG and/or mCH and act as an alternative mode of epigenetic repression. Nevertheless, we did not find differential expression in any of the four core components of PRC2 (Ezh2, Suz12, Eed and Rbbp4) in adult Dnmt3a cKO animals. It is possible that the increased H3K27me3 was mediated by transient expression of PRC2 components during development in the cKO. Furthermore, the predictions from BART (Figure 4A) were derived from various cell lines and tissues from the ENCODE project (Davis et al., 2018; ENCODE Project Consortium, 2012), suggesting that the potential PRC2 binding at our DEGs may normally happen in systems other than the brain or pyramidal neurons, or at other time points during development. Additional experiments which directly manipulate components of the PRC2 system are required to further test the potential compensation mechanism."

      1. The study includes limited analyses assessing how changes in mCH and H3K27ac, two other epigenetic marks shown to be disrupted in DNMT3A models, are integrated with changes in H3K27me3, mCG and gene expression.

      We found an increase in H3K27ac, specifically at DMRs which lose mCG in the cKO (shown in Figure 5C). This was an expected finding reflecting the epigenetic activation of enhancer regions that fail to gain DNA methylation.

      Regarding mCH, our study was originally motivated by our interest in the role of mCH in neural development, and we were very interested in exploring this question. The complete loss of mCH is indeed a very dramatic effect of the cKO (Figure 3C), and this genome-wide disruption of the normal DNA methylation pattern might have been expected to severely impact neural function. Instead, our data showed relatively limited alterations in neural gene expression, as well as synaptic physiology and social behavior. Thus, although we did analyze the link between mCH and gene expression (e.g. in Figure 3D-E), we found that the loss of mCH could explain only a very small fraction (0.456%) of the differential expression (Supplementary Figure S11D). By contrast, mCG changes occur in a localized fashion specifically in regions that are developmentally regulated and gaining mCG via Dnmt3a during postnatal development. Because we found a clear association between these mCG differences and H3K27me3, we performed a more in-depth analysis on those marks.

      Overall, the study has generated valuable datasets that identify cellular phenotypes and suggest a novel disruption of H3K27me3 in DNMT3A conditional knockout mice. However, the conclusions regarding the importance of H3K27me3 in compensation in these mutant mice are quite speculative.

    1. Author Response

      Reviewer #1 (Public Review):

      1) The authors present an interesting proposal for how the generative model operates when producing shapes in Fig 6, as well as some alternative strategies in Fig 7. It is not clear what evidence supports the idea that shapes are first broken down into parts, then modified and recombined. It is obvious from the data that distinctive features are preserved (in some cases), but some clarification on the rest would be useful. For instance, is it possible that conjunctions or combinations of features are processed in concert? What determines whether critical features are added or subtracted to the shape during generation? Some more justification for this proposed model is needed, as well as for how the exceptions and alternate strategies were determined.

      In line with recent eLife policy, we have moved our discussion of how new shapes might be produced into a new subsection called ‘ideas and speculation’ to emphasise that this is a speculative proposal that goes beyond the data, rather than a straightforward report of findings per se. Such speculations are actively encouraged if appropriately flagged (see https://elifesciences.org/inside-elife/e3e52a93/elife-latest-including-ideas-and-speculation-in-elife-papers). In places, we have also reworded the description to make it clearer that our proposals are based on a qualitative assessment of the data (looking at the shapes and trying to verbalize what seemed to be going on) rather than a formal quantitative analysis.

      However, our proposal is also compatible with some analyses of our data. We have added a new analysis to Experiment 4 to test whether part order has been retained or changed between Exemplars and Variations. This analysis allows us to quantify our previous observations of different strategies (cf. Fig. 7). For example, we show that there are drawings where with respect to the Exemplar the order of parts was shuffled, parts were omitted or parts were added—all pointing to a part-based recombination approach. However, we also qualified our discussion to clarify that this part-based recombination is not the only possible strategy. We have also added the reviewer’s observation that multiple parts are sometimes retained or modified in conjunction with one another.

      2) Some claims are made in the manuscript about large changes being made to Variations without consequence to effective categorization. However, these appeal to findings derived from collapsing across all Variations, when it could be informative to investigate the edge cases in more detail. There is a broad range in the similarity of Variations to Exemplars, and this could have been profitably considered in some analyses, especially zooming in on the 'Low Similarity' Variations. For example, this would help determine whether classification performance and the confusion matrix change in predictable ways for high-, relative to medium- and low-similarity Variations. It could also indicate whether the features and feature overlap can tell us anything about how likely a Variation is to be perceived as from the correct category.

      To address this point, we have added a new analysis to Experiment 3, which compares the classification performance across 4 similarity bins (from low- to high-similarity). This reveals that performance remained high—indeed virtually identical—for the three ‘most similar’ bins. Only the ‘least similar’ bin showed a slightly reduced performance, albeit, still at a low level of mis-classifications. We now describe this analysis and the results in the text; here, we additionally show the confusion matrices per similarity bin.

      3) The authors cross-referenced data from Experiments 4 and 5 to draw the conclusion that the most distinct features are preserved in Variations. This was very compelling and raised the idea that there are further opportunities to perform cross-experiment comparisons to better support the existing claims. For example, perhaps the correspondence percentages in Exp 4, or the 'distinctive feature-ness' in E5, allow prediction of the confusion proportions in Exp 3.

      Thanks for this suggestion. We have added a new subplot to Fig S 2 showing that the average percentage of area decreases as a function of decreasing similarity to the Exemplar. We now also report this result in the text (Experiment 4).

      4) The Variation generation task did not require any explicit discrimination between objects to establish category learning, which is a strength of the work that the authors highlighted. However, it's worth considering that discrimination may have had some lingering impact on Variation generation, given that participants were tasked with generating Variations for multiple exemplars. Specifically, when they are creating Variations for Exemplar B after having created Variations for Exemplar A, are they influenced both by trying to generate something that is very like Exemplar B but also something that is decidedly not like Exemplar A? A prediction that logically follows from this would be that there are order effects, such that metrics of feature overlap and confusion across categories decreases for later Exemplars.

      We now discuss potential carry-over effects in Experiment 1, together with how we tried to minimize these effects by randomizing the order of Exemplars per participant. We also added to the discussion section how future studies might use crowd-sourcing with only a single Exemplar to completely eliminate such effects.

      In an additional analysis not reported in the study we find that the ‘age’ of a drawing (i.e., whether it was drawn earlier or later in the experiment) is not significantly correlated to the percentage of correct categorizations in Experiment 3 (r = -0.04). Although this does not rule out carry-over effects completely, it does suggest that they did not significantly affect categorization decisions.

      Reviewer #2 (Public Review):

      Overall, I find the paper compelling, the experiments methodologically rigorous, and the results clear and impactful. By using naïve online observers, the researchers are able to make compelling arguments about the generalizability of their effects. And, by creative methods such as swapping out the distinctive (vs. less distinctive) features and then testing categorization, they are able to successfully pinpoint some of the determinants of one-shot learning.

      We would like to clarify that all experiments were done in person and not over the internet, as reviewer #2 mentioned “naive online observers” in a comment. After carefully checking the text we could find no mention of online experiments.

    1. Author Response:

      Reviewer #1:

      This study examines the use of terahertz wave modulation (THM), a technique for transmitting terahertz wave electromagnetic energy to the cochlea with the aim of improving the sensitivity of the cochlear outer hair cells. ABR obtained with and without THM suggests that sensitivity thresholds were improved by 10 dB when using THM. Whole-call patch clam recordings from outer hair cells suggest that THM significantly increases both K+ and MET currents of the cochlear outer hair cells. These results are convincing and potentially important for understanding normal cochlear physiology.

      On the other hand, the numerous claims about translational applicability of this work seem overstated.

      61-65 This is incorrect. For example, optogenetics or stem cell use are not currently seen as "treatment for hearing impairment" and, in fact, the manuscript says as much later in the paragraph. Also, pharmacological treatment is rarely effective, and only in limited circumstances.

      Many thanks to reviewers for pointing out this mistake, We have replaced the discussion by:

      “At present, treatment for hearing impairment is primarily administered through pharmacological treatment, hearing aid equipment, and electronic cochlear implantation (Wilson et al., 1991; Kipping et al., 2020; Gang et al., 2008). Optogenetics (Huet et al., 2021), stem cell differentiation and transplantation (Oshima et al., 2010; Li et al., 2003; Chen et al., 2012) are also being explored to treat hearing loss. However, pharmacological treatment is rarely effective, and only in limited circumstances.”

      283-294 The discussion of near-infrared vs THM is misguided. Near-infrared has been proposed as a possible alternative technology to stimulate spiral ganglion neurons, thus replacing cochlear implants. This is plausible, even though feasibility has not yet been demonstrated. In contrast, THM does not seem like a plausible alternative to cochlear implants. Patients who are candidates for cochlear implantation may not have enough (or any) outer hair cells, which are the target for THM.

      Thank the reviewer for pointing out the difference in principle between Near-infrared auditory stimulation and THM. We have now modified the main text and compared the differences and similarities between THM and NIRS. Please see the revised Discussion.

      295-299 "In comparison with wearing hearing aids, stem cell differentiation and transplantation (Oshima et al., 2010; Li et al., 2003; Chen et al., 2012), optogenetics (Huet et al., 2021) and electronic cochlear implantation (Wilson et al., 1991; Kipping et al., 2020; Gang et al., 2008), THM requires no traumatic surgery, cumbersome equipment, or genetic manipulation, and is thus more suitable for use in human subjects." In the described experiment, optic fibers had to be placed close to outer hair cells. That seems to require "cumbersome equipment" and obviously would require surgery for use in humans.

      Many thanks to the reviewer for pointing out these inappropriate statement. We completely agree. We have now revised this statement in the revised manuscript.

      The data show that sensitivity was improved by 8.75 dB. In practical terms this is a very small change. Sensitivity improvement of 10 dB (and much more than that) can be obtained non invasively and on a frequency dependent basis using traditional amplification.

      Any neural stimulation technology would require not only spatial selectivity but also temporal responsiveness. It seems that THM could meet the former criteria but the latter is unknown. In other words, for any practical application it would be necessary to show that modulation of a THM signal can be perceived by listeners. However, this criticism is moot if the claims about clinical applicability of THM are removed.

      We thank for the reviewer’s constructive comments. We completely agree with these comments and the claims about clinical applicability of THM are removed.

      Reviewer #2:

      This manuscript uses mid-infrared light to enhance the currents from natural stimuli (mechanical and voltage) of hair cells. The authors show increased voltage-gated K+ current and MET currents while being illuminated with mid-infrared light. Based on molecular dynamics simulations, the authors hypothesize that the augmented voltage-gated K+ currents are due to stimulation of C=O groups in the selectivity filter which allows K+ ions to pass through the pore more quickly to increase conductance; there was no hypothesis as to why MET currents were augmented. The authors also demonstrate improved ABR thresholds when the cochlea was illuminated with the mid-infrared light, demonstrating a potential therapeutic application. The enthusiasm for the novelty of this work is reduced because other work has shown that neurons can be excited by near-infrared (~2 microns) wavelength due to thermal stimulation and changes in cell capacitance, so this work mainly differs in their proposed mechanism and the longer wavelength of light (8.6 microns). Additionally, the Hudspeth group (Azimzadeh et al, 2018, PMC5805653) has shown thermal gating of MET channels using ultraviolet light and infrared light (1.47 microns). If the THM mechanism is indeed different from thermal stimulation, this would be a novel therapeutic mode, however, the data are not yet convincing that thermal stimulation is not the mechanism of action.

      We thank the reviewer’s suggestions that are essential for improving our manuscript, in particular to pointing out the important literature about thermal gating of MET channels. We have now cited and discussed this review paper and other related papers.

      Since the structure of the MET channels have not been resolved, we cannot study the mechanism at the atomic or chemical bond level by molecular dynamics.

      Infrared stimulation is emerging as an area of interest for neuromodulation and potential clinical application.While most studies on infrared stimulation have been conducted at near infrared wavelengths, whether mid-infrared wavelengths can impact neuronal function is unknown. A large number of studies have shown that the threshold of action potential generated by INS stimulation is correlated with the solution absorption coefficient to wavelength, that is, the higher the solution absorption coefficient is, the lower the threshold is. Therefore, the mechanism of action potential induced by INS is generally believed to be the rapid rise of solution temperature caused by INS, namely “ Photothermal effect ”[1]. However, as figure R1 shown, the absorption of water to the wavelength 8.6 μm we use is very weak.

      How does near-infrared light affect the excitability of cells or nerves through “ photothermal effect ”, so as to promote the generation or propagation of action potential in neurons or inhibit the generation or propagation of action potential? In other words, what is the target of “ photothermal effect ” ? Currently, there are few studies on the mechanisms, and the possible biophysical mechanisms include the following three:

      (1) After INS is absorbed by solution , the solution temperature increases rapidly, the membrane capacitance changes and the inward current is induced, which leads to the depolarization of membrane potential and the generation of action potential[2]; (2) INS activates temperature-sensitive TRP ion channels, which causes an action potential[3]; (3) INS enhanced inhibitory postsynaptic by acting on GABA receptor, thus producing inhibitory effect[4].

      At present, the wavelength of INS is mainly near infrared light (1-3 microns), the parameters used are not consistant, and there are many factors affecting the excitation or inhibition of INS (such as the diameter of the fiber, the energy of infraredlight, pulse width, repetition frequency). On the one hand, photothermal effect is difficult to control, and some studies have found that overheating photothermal effect will block the generation and propagation of action potential, and even cause irreversible effects of INS on inhibition of action potential and tissue damage [5]. On the other hand, it is difficult to determine the target of photothermal action, which hinders the safe and effective promotion of INS as a neuroregulatory tool to the clinical or research field. Therefore, new regulatory strategies with more explicit mechanisms are needed in the field of photoneural regulation.

      References:

      1. Wells, J., Kao, C., Konrad, P., Milner, T., Kim, J., Mahadevan-Jansen, A., Jansen, E.D.: Biophysical mechanisms of transient optical stimulation of peripheral nerve. Biophysical Journal. 93, 2567-2580 (2007).

      2. Shapiro, M.G., Homma, K., Villarreal, S., Richter, C.P., Bezanilla, F.: Infrared light excites cells by changing their electrical capacitance. Nature Communications. 3, (2012).

      3. Albert, E.S., Bec, J.M., Desmadryl, G., Chekroud, K., Travo, C., Gaboyard, S., Bardin, F., Marc, I., Dumas, M., Lenaers, G., Hamel, C., Muller, A., Chabbert, C.: TRPV4 channels mediate the infrared laser-evoked response in sensory neurons. Journal of Neurophysiology. 107, 3227–3234 (2012).

      4. Feng, H.J., Kao, C., Gallagher, M.J., Jansen, E.D., Mahadevan-Jansen, A., Konrad, P.E., Macdonald, R.L.: Alteration of GABAergic neurotransmission by pulsed infrared laser stimulation. Journal of Neuroscience Methods. 192, 110–114 (2010).

      5. Walsh, A.J., Tolstykh, G.P., Martens, S., Ibey, B.L., Beier, H.T.: Action potential block in neurons by infrared light. Neurophotonics. 3, 040501 (2016).

      The authors hypothesize that the increase in K+ current through voltage gated channels is due to increasing the speed of movement of the K+ ions through the selectivity filter, which they modeled with molecular dynamics simulations. However, the simulations are not validated with experimental manipulations.

      We thank the reviewer for pointing this out. As shown in Figure R1, we overlapped the vibration spectra of modeled channels and the attenuation of infrared light in water.

      Figure. R1. Comparisons of the absorption intensity of water molecular (green curve), Na+ channel (orange curve), and K+ channel (black curve) from our MD simulation, and the values from other molecular dynamics calculations [1] (purple star), respectively.

      As shown in the FIG. R1, the strong absorption of THz wave located at the frequency of 49.86 THz for K+ channel, but it falls in the strong absorption region of water molecules. Otherwise, THz wave modulation (THM) will be interfered with by the thermal effect caused by the large absorption of water molecules.

      For Na+ channels, the strongest absorption peak is located at 48.20 THz, which is consistent with these calculation results reported in the references of <PNAS 118, e2015685118 (2021)>. Nevertheless, it falls in the absorption region of water molecules and can be preferentially large absorbed by water molecules. In theory, the frequency of 39.82 THz can avoid the absorption of water molecules and regulate the carboxyl (-COO-) groups of Na+ channels in a non-thermal way, thus promoting or inhibiting the Na+ current. Unfortunately, these results are difficult to be confirmed by experiment methods due to no strong enough of the intensity of light source corresponding to this frequency, so the laser cannot be effectively coupled to the optical fiber to focus on nerve cells, which affects the current test of ion channel under terahertz stimulations [2]. We believe that the regulation characteristics of terahertz waves with specific frequency on Na+ channels will be further studied when the light source and coupling technology of correlation frequency are well developed in the future.

      References:

      1. Xi Liu†, Zhi Qiao†, Yuming Chai†, Zhi Zhu†, Kaijie Wu, Wenliang Ji, Daguang Li, Yujie Xiao, Junlong Li, Lanqun Mao, Chao Chang, Quan Wen, Bo Song, Yousheng Shu, Non-thermal and reversible control of neuronal signaling and behavior by mid-infrared stimulation. Proc. Natl. Acad. Sci. U. S. A. 118 (10): e2015685118, (2021).

      2. Seddon, Angela B. "Mid-infrared (MIR) photonics: MIR passive and active fiberoptics chemical and biomedical, sensing and imaging." Emerging Imaging and Sensing Technologies. International Society for Optics and Photonics, 9992, 999206, (2016).

      It was unclear to this reviewer whether the temperature effect would be measurable with the technique used. It appears that the temperature measuring system is rather large as compared to the cell, therefore it would likely measure changes in bulk solution temperature and not necessarily a local or micro-scale change in temperature that the cell may be responding too. Additionally, Littlefield and Richter has suggested that temperature changes on the order of 0.1 degrees Celsius are sufficient to evoke action potentials (Littlefield & Richter, 2021, PMC8035937), which is well within the temperature changes observed by the authors. At the longer wavelengths used in this study, the absorption of water is generally even higher as well, suggesting even greater temperature changes with the same power. In vestibular hair cells a 10 deg Celsius increase in temperature led to a 50-60% increase in peak MET current (Songer & Eatock, 2013, PMC3857958).

      We thank the reviewer for pointing out this issue. Indeed, the temperature measuring system is rather large as compared to the cell. we performed the temperature measurement protocal with an ADINSTRUMENT acquisition system (PowerLab 4/35) coupled to a T-type hypodermic thermocouple (MT 29/5, Physitemp),the diameter of the thermocouple is 100 μm. However, our new experiment on measuring tissue temperature in vitro showed that the maximum temperature elevation was less than 4 °C with the 75 mW stimulation, which was much lower than the temperature measured in the reference paper (10°C,Songer & Eatock, 2013, PMC3857958) and another paper (Littlefield & Richter, 2021, PMC8035937) mentioned by this reviewer also proposed in the introduction that light stimulation arouses neural responses due to photons rather than heat.. In addition, when the power is 10 mW, the temperature rise is not more than 1°C. two studies have found light illumination that is commonly used for optogenetics increases the temperature by ~2°C[1-2].This temperature elevation is associated with the inhibition of neuronal spiking in different brain areas and cannot explain the excitation effect observed in our experiment by the THM. We now mentioned this point in the main text. In addition, we also mention in the main text that the wavelength of 8.6 μm falls in the strong absorption region of water.

      References: 1. Owen, S. F., Liu, M. H. & Kreitzer, A. C. Thermal constraints on in vivo optogenetic manipulations. Nat. Neurosci. 22, 1061–1065 (2019) 2. Ait Ouares, K., Beurrier, C., Canepari, M., Laverne, G. & Kuczewski, N. Opto nongenetics inhibition of neuronal firing. Eur. J. Neurosci. 49, 6–26 (2019).

      In figure 1, when THM is on, there appears to be an increase in the inward current without any mechanical stimulation. There is no discussion of this, and this could be a baseline effect that is not aimed at simply enhancing existing conductances. The increase in K+ conductance seen in the voltage-gated K channel cannot account for this increased inward current, since K+ conductance is outward. THM itself could also activate a small amount of MET current, maybe via the thermal effect demonstrated by Azimzadeh et al. This increased conductance could also be from the Tmc1 leak conductance that the authors have published on previously.

      We thank the reviewer for pointing out this issue, in particular for suggesting several possible reasons about the increase in the inward current. We have now discussed this effect and cited related papers. In addition, the increase in MET currents caused by THM was far greater than the baseline offset, indicating that THM has a non-thermal effect.

      Line 232-233: With regard to the ABR data, data is not shown about whether an OABR can be elicited. The data show that once the THM is turned on and then a click stimulus is presented, there is no response; however, this experiment does not really test whether the THM can evoke an OABR since many repetitions are required to get the ABR waveform out of the noise. If THM is on and the stimulus is below threshold, then there is unlikely going to be an evoked response since the THM stimulus is not synchronized with the ABR recording. The authors need to show that THM onset stimulation that is synchronized with the ABR recording does not result in an ABR waveform.

      We thank the reviewer for suggesting this very important experiment. Following this suggestion, we test whether the THM onset stimulation that is synchronized with the ABR recording can evoke an OABR. We now present the new data in Figure S5.

    1. Author Response:

      Reviewer #2 (Public Review):

      The neuronal MAP doublecortin contains two homologous DC domains, referred to as NDC and CDC. Disease-causing mutations cluster in these domains and both have been implicated in microtubule binding. However, the stoichiometry of DCX:tubulin dimers on microtubules is 1:1, suggesting only one of these domains is DCX's primary microtubule binding module. Early structural studies by Kim et al, 2001, identified different properties of NDC and CDC, despite their predicted homology. High resolution structures of both NDC and CDC have since been determined using X-ray crystallography and NMR - the domains do adopt the same overall fold, although DCX CDC structures were determined either a) bound to nanobodies (Burger et al, 2016; 5IP4) or b) forming a domain swapped dimer in a protein purified at pH 10.5 (Rufer et al, 2018; 6FNZ).

      The structures of microtubule-bound DCX have also been determined using cryo-EM - these show DCX's primary microtubule binding site is in the valley between protofilaments at the corner of four tubulin dimers. Most recently, the structures of full-length DCX at different microtubule polymerization time points have been captured at ~4A resolution (Manka & Moores, 2020). The structures of microtubule-bound CDC (6RF2) and microtubule-bound NDC (6REV) were thereby determined, but only a single DC domain at the DCX primary binding site has ever been observed.

      Thus, despite the accumulated DCX structural data, a number of significant questions remain - notably, how is the full-length protein involved in binding to microtubules and what is the structural origin of the cooperative microtubule binding by DCX, which is mediated by CDC (Bechstedt and Brouhard, 2012)

      Rafiei et al use an integrated structural modelling approach, synthesizing cross-linking mass spectrometry data of microtubule-bound DCX with existing structural information to provide new perspectives on DCX's microtubule binding mechanism. The particular strengths of this approach are that the data are both detailed, and capable of capturing the heterogeneity and dynamics of the system. The incorporation of prior structural knowledge into the workflow mean that these analyses sit alongside existing data, rather than being completely independent from them.

      Overall, the authors confirm findings in the literature that NDC is DCX's primary microtubule binding domain for microtubules polymerized for >30 minutes. They also find that CDC mediates microtubule-binding dependent dimerization, which could explain DCX's cooperative behavior. There are several aspects of the study that would benefit from further analysis and/or discussion to clarify potential limitations of, or assumptions in, the approaches taken:

      1) Although the authors report that the crosslinker used in their mass-spec experiments has been optimized for use with microtubules, it is not clear how general DCX binding is in this context. Specifically, how accessible are the well-buried DCX-tubulin interfaces at the primary binding site to the chemical cross-linkers on which the analysis depends? Accessibility issues could explain the results depicted in Fig. 3A, B, in which modelling that relies strictly on cross-links places NDC towards the outer edge of the protofilament, whereas inclusion of cryo-EM data in the integrated model places NDC in the inter-protofilament valley.

      There are no accessibility issues related to the crosslinks. In fact, we observe crosslinks to sites that are well buried in the cleft, as shown in the figure below (1A). This is in line with data from a previous paper on MT crosslinking (Legal et al., 2016). The appearance of the models sitting near the outer edge of the protofilament is due to how we chose to represent the system, and is an expected edge effect. It is approximately half of the actual binding site and so expected to compete. To illustrate that accessibility is not an issue, we re-clustered the models with a lower threshold (2 Å) to generate smaller major cluster (22% of the total) where the NDC is positioned even more deeply within the inter-protofilament valley, as shown in the figure below (1B). Clustering at higher threshold is preferred because it repesents modeling uncertainty more faithfully by including the majority of the models generated during sampling.

      Figure 1 (A) Crosslink sites on the MT lattice repeat unit highlighted in blue, showing that some are indeed buried within the interprotofilament groove. (B) Alternative representation showing the buried nature of NDC on the lattice.

      2) Based on analysis using the nanobody-bound CDC structure (5IP4), CDC appears to behave distinctly compared to NDC, such that CDC-derived cross-linking data are not consistent with the canonical inter-protofilament binding site. It would be good know whether this depends on the particular PDB used. It would be important to repeat this analysis using the microtubule-bound structure of CDC (6RF2), given that this structure is conformationally distinct from PDB:5IP.

      We calculate the RMSD between 5IP4 and 6RF2 to be 5.1 Å, and show the alignment of the structures below. This is a small difference when considering the precision of our integrative method, and thus would not change the results/conclusions presented in our paper. (Note that crosslinks are contrained with a distance of ~25 Å or less.) We have added a statement to the text to reflect this.

      Figure 2 Structural alignment of the new MT-CDC structure (6RF2) to the one used in our study (5IP4), placed at the NDC binding site for illustration. CDC structures corresponding to 6RF2 and 5IP4 are shown with blue and cyan, respectively, alpha tubulins are shown in light grey and beta tubulins are shown in dark grey, The RMSD calculated for residues 178-251 of the 5IP4 and 6RF2 is 5.1 Å.

      3) Building on these findings relating to DCX-microtubule interactions, further analyses focus on DCX-DCX cross links, the formation of which are shown to be microtubule-dependent. The authors observe that >80% of DCX-DCX crosslinks involve the CDC domain and the C-terminus of the protein (C-tail), which is also consistent with NDC being the major point of microtubule interaction. However, a crucial aspect of this analysis is how readily microtubule-mediated oligomerization of DCX-DCX can be discriminated from the non-specific interactions that occur due to the high local concentrations on the microtubule surface. Given the proposed primary microtubule binding role of NDC, either set of interactions would presumably involve CDC and C-tail. Additional control experiments would have been beneficial here.

      Although their data do not allow them to discriminate between different oligomerization states of DCX, the authors focus on dimer formation, and they interrogate their data based on interactions between CDC domains either i) retaining a globular fold or ii) adopting the "open" state seen in the 6FNZ domain-swapped dimer. According to the authors: "Based purely on fit of crosslinks, globular or domain-swapped modes are not distinguishable (Fig 4B). However, modelling of the main cluster shows strong similarity to the domain-swapped dimer structure"

      This is a pivotal point of the manuscript. However, the precise quantitative basis of this discrimination is not clearly described. A useful control for these experiments could also be a previously published NDC-NDC chimera (Manka & Moores, 2020), which binds microtubules at the same inter-protofilament site but which lacks the CDC domain that is potentially mediating oligomerization.

      The authors present an appealing model for CDC-mediated dimerisation of DCX on the microtubule lattice, but do not directly test its functional relevance. It will be crucial to explore the significance of dimer formation further. In the meantime, while questions concerning the mode of interaction of DCX (and its relatives) with the microtubule lattice are very much alive, the findings in the current study are not currently definitive.

      We thank the reviewer for these insights. We note that nonspecific aggregation of DCX on the MT lattice is unlikely, given the absence of aggregation at high concentration in free solution, even under induced denaturation. Further, we would expect such aggregation to be far less localized than we observe. We hope that the addition of the R303X truncation and the TIRF-based cooperativity data provides additional confidence in our claim that lattice-driven self-association is an important element of DCX function.

    1. Author Response

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public Review):

      The authors present a number of deep learning models to analyse the dynamics of epithelia. In this way they want to overcome the time-consuming manual analysis of such data and also remove a potential operator bias. Specifically, they set up models for identifying cell division events and cell division orientation. They apply these tools to the epithelium of the developing Drosophila pupal wing. They confirm a linear decrease of the division density with time and identify a burst of cell division after healing of a wound that they had induced earlier. These division events happen a characteristic time after and a characteristic distance away from the wound. These characteristic quantities depend on the size of the wound.

      Strengths:

      The methods developed in this work achieve the goals set by the authors and are a very helpful addition to the toolbox of developmental biologists. They could potentially be used on various developing epithelia. The evidence for the impact of wounds on cell division is compelling.

      The methods presented in this work should prove to be very helpful for quantifying cell proliferation in epithelial tissues.

      We thank the reviewer for the positive comments!

      Reviewer #2 (Public Review):

      In this manuscript, the authors propose a computational method based on deep convolutional neural networks (CNNs) to automatically detect cell divisions in two-dimensional fluorescence microscopy timelapse images. Three deep learning models are proposed to detect the timing of division, predict the division axis, and enhance cell boundary images to segment cells before and after division. Using this computational pipeline, the authors analyze the dynamics of cell divisions in the epithelium of the Drosophila pupal wing and find that a wound first induces a reduction in the frequency of division followed by a synchronised burst of cell divisions about 100 minutes after its induction.

      Comments on revised version:

      Regarding the Reviewer's 1 comment on the architecture details, I have now understood that the precise architecture (number/type of layers, activation functions, pooling operations, skip connections, upsampling choice...) might have remained relatively hidden to the authors themselves, as the U-net is built automatically by the fast.ai library from a given classical choice of encoder architecture (ResNet34 and ResNet101 here) to generate the decoder part and skip connections.

      Regarding the Major point 1, I raised the question of the generalisation potential of the method. I do not think, for instance, that the optimal number of frames to use, nor the optimal choice of their time-shift with respect to the division time (t-n, t+m) (not systematically studied here) may be generic hyperparameters that can be directly transferred to another setting. This implies that the method proposed will necessarily require re-labeling, re-training and re-optimizing the hyperparameters which directly influence the network architecture for each new dataset imaged differently. This limits the generalisation of the method to other datasets, and this may be seen as in contrast to other tools developed in the field for other tasks such as cellpose for segmentation, which has proven a true potential for generalisation on various data modalities. I was hoping that the authors would try themselves testing the robustness of their method by re-imaging the same tissue with slightly different acquisition rate for instance, to give more weight to their work.

      We thank the referee for the comments. Regarding this particular biological system, due to photobleaching over long imaging periods (and the availability of imaging systems during the project), we would have difficulty imaging at much higher rates than the 2 minute time frame we currently use. These limitations are true for many such systems, and it is rarely possible to rapidly image for long periods of time in real experiments. Given this upper limit in framerate, we could, in principle, sample this data at a lower framerate, by removing time points of the videos but this typically leads to worse results. With some pilot data, we have tried to use fewer time intervals for our analysis but they always gave worse results. We found we need to feed the maximum amount of information available into the model to get the best results (i.e. the fastest frame rate possible, given the data available). Our goal is to teach the neural net to identify dynamic space-time localised events from time lapse videos, in which the duration of an event is a key parameter. Our division events take 10 minutes or less to complete therefore we used 5 timepoints in the videos for the deep learning model. If we considered another system with dynamic events which have a duration T when we would use T/t timepoints where t is the minimum time interval (for our data t=2min). For example if we could image every minute we would use 10 timepoints. As discussed below, we do envision other users with different imaging setups and requirements may need to retrain the model for their own data and to help with this, we have now provided more detailed instructions how to do this (see later).

      In this regard, and because the authors claimed to provide clear instructions on how to reuse their method or adapt it to a different context, I delved deeper into the code and, to my surprise, felt that we are far from the coding practice of what a well-documented and accessible tool should be.

      To start with, one has to be relatively accustomed with Napari to understand how the plugin must be installed, as the only thing given is a pip install command (that could be typed in any terminal without installing the plugin for Napari, but has to be typed inside the Napari terminal, which is mentioned nowhere). Surprisingly, the plugin was not uploaded on Napari hub, nor on PyPI by the authors, so it is not searchable/findable directly, one has to go to the Github repository and install it manually. In that regard, no description was provided in the copy-pasted templated files associated to the napari hub, so exporting it to the hub would actually leave it undocumented.

      We thank the referee for suggesting the example of (DeXtrusion, Villars et al. 2023). We have endeavoured to produce similarly-detailed documentation for our tools. We now have clear instructions for installation requiring only minimal coding knowledge, and we have provided a user manual for the napari plug-in. This includes information on each of the options for using the model and the outputs they will produce. The plugin has been tested by several colleagues using both Windows and Mac operating systems.

      Author response image 1.

      Regarding now the python notebooks, one can fairly say that the "clear instructions" that were supposed to enlighten the code are really minimal. Only one notebook "trainingUNetCellDivision10.ipynb" has actually some comments, the other have (almost) none nor title to help the unskilled programmer delving into the script to guess what it should do. I doubt that a biologist who does not have a strong computational background will manage adapting the method to its own dataset (which seems to me unavoidable for the reasons mentioned above).

      Within the README file, we have now included information on how to retrain the models with helpful links to deep learning tutorials (which, indeed, some of us have learnt from) for those new to deep learning. All Jupyter notebooks now include more comments explaining the models.

      Finally regarding the data, none is shared publicly along with this manuscript/code, such that if one doesn't have a similar type of dataset - that must be first annotated in a similar manner - one cannot even test the networks/plugin for its own information. A common and necessary practice in the field - and possibly a longer lasting contribution of this work - could have been to provide the complete and annotated dataset that was used to train and test the artificial neural network. The basic reason is that a more performant, or more generalisable deep-learning model may be developed very soon after this one and for its performance to be fairly compared, it requires to be compared on the same dataset. Benchmarking and comparison of methods performance is at the core of computer vision and deep-learning.

      We thank the referee for these comments. We have now uploaded all the data used to train the models and to test them, as well as all the data used in the analyses for the paper. This includes many videos that were not used for training but were analysed to generate the paper’s results. The link to these data sets is provided in our GitHub page (https://github.com/turleyjm/cell-division-dl- plugin/tree/main). In the folder for the data sets and in the GitHub repository, we have included the Jupyter notebooks used to train the models and these can be used for retraining. We have made our data publicly available at Zenodo dataset https://zenodo.org/records/10846684 (added to last paragraph of discussion). We have also included scripts that can be used to compare the model output with ground truth, including outputs highlighting false positives and false negatives. Together with these scripts, models can be compared and contrasted, both in general and in individual videos. Overall, we very much appreciate the reviewer’s advice, which has made the plugin much more user- friendly and, hopefully, easier for other groups to train their own models. Our contact details are provided, and we would be happy to advise any groups that would like to use our tools.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The authors present a number of deep-learning models to analyse the dynamics of epithelia. In this way, they want to overcome the time-consuming manual analysis of such data and also remove a potential operator bias. Specifically, they set up models for identifying cell division events and cell division orientation. They apply these tools to the epithelium of the developing Drosophila pupal wing. They confirm a linear decrease of the division density with time and identify a burst of cell division after the healing of a wound that they had induced earlier. These division events happen a characteristic time after and a characteristic distance away from the wound. These characteristic quantities depend on the size of the wound.

      Strength:

      The methods developed in this work achieve the goals set by the authors and are a very helpful addition to the toolbox of developmental biologists. They could potentially be used on various developing epithelia. The evidence for the impact of wounds on cell division is solid.

      Weakness:

      Some aspects of the deep-learning models remained unclear, and the authors might want to think about adding details. First of all, for readers not being familiar with deep-learning models, I would like to see more information about ResNet and U-Net, which are at the base of the new deep-learning models developed here. What is the structure of these networks?

      We agree with the Reviewer and have included additional information on page 8 of the manuscript, outlining some background information about the architecture of ResNet and U-Net models.

      How many parameters do you use?

      We apologise for this omission and have now included the number of parameters and layers in each model in the methods section on page 25.

      What is the difference between validating and testing the model? Do the corresponding data sets differ fundamentally?

      The difference between ‘validating’ and ‘testing’ the model is validating data is used during training to determine whether the model is overfitting. If the model is performing well on the training data but not on the validating data, this a key signal the model is overfitting and changes will need to be made to the network/training method to prevent this. The testing data is used after all the training has been completed and is used to test the performance of the model on fresh data it has not been trained on. We have removed refence to the validating data in the main text to make it simpler and add this explanation to the methods. There is no fundamental (or experimental) difference between each of the labelled data sets; rather, they are collected from different biological samples. We have now included this information in the Methods text on page 24.

      How did you assess the quality of the training data classification?

      These data were generated and hand-labelled by an expert with many years of experience in identifying cell divisions in imaging data, to give the ground truth for the deep learning model.

      Reviewer #1 (Recommendations For The Authors):

      You repeatedly use 'new', 'novel' as well as 'surprising' and 'unexpected'. The latter are rather subjective and it is not clear based on what prior knowledge you make these statements. Unless indicated otherwise, it is understood that the results and methods are new, so you can delete these terms.

      We have deleted these words, as suggested, for almost all cases.

      p.4 "as expected" add a reference or explain why it is expected.

      A reference has now been included in this section, as suggested.

      p.4 "cell divisions decrease linearly with time" Only later (p.10) it turns out that you think about the density of cell divisions.

      This has been changed to "cell division density decreases linearly with time".

      p.5 "imagine is largely in one plane" while below "we generated a 3D z-stack" and above "our in vivo 3D image data" (p.4). Although these statements are not strictly contradictory, I still find them confusing. Eventually, you analyse a 2D image, so I would suggest that you refer to your in vivo data as being 2D.

      We apologise for the confusion here; the imaging data was initially generated using 3D z-stacks but this 3D data is later converted to a 2D focused image, on which the deep learning analysis is performed. We are now more careful with the language in the text.

      p.7 "We have overcome (...) the standard U-Net model" This paragraph remains rather cryptic to me. Maybe you can explain in two sentences what a U-Net is or state its main characteristics. Is it important to state which class you have used at this point? Similarly, what is the exact role of the ResNet model? What are its characteristics?

      We have included more details on both the ResNet and U-Net models and how our model incorporates properties from them on Page 8.

      p.8 Table 1 Where do I find it? Similarly, I could not find Table 2.

      These were originally located in the supplemental information document, but have been moved to the main manuscript.

      p.9 "developing tissue in normal homeostatic conditions" Aren't homeostatic and developing contradictory? In one case you maintain a state, in the other, it changes.

      We agree with the Reviewer and have removed the word ‘homeostatic’.

      p.9 "Develop additional models" I think 'models' refers to deep learning models, not to physical models of epithelial tissue development. Maybe you can clarify this?

      Yes, this is correct; we have phrased this better in the text.

      p.12 "median error" median difference to the manually acquired data?

      Yes, and we have made this clearer in the text, too.

      p.12 "we expected to observe a bias of division orientation along this axis" Can you justify the expectation? Elongated cells are not necessarily aligned with the direction of a uniaxially applied stress.

      Although this is not always the case, we have now included additional references to previous work from other groups which demonstrated that wing epithelial cells do become elongated along the P/D axis in response to tension.

      p.14 "a rather random orientation" Please, quantify.

      The division orientations are quantified in Fig. 4F,G; we have now changed our description from ‘random’ to ‘unbiased’.

      p.17 "The theories that must be developed will be statistical mechanical (stochastic) in nature" I do not understand. Statistical mechanics refers to systems at thermodynamic equilibrium, stochastic to processes that depend on, well, stochastic input.

      We have clarified that we are referring to non-equilibrium statistical mechanics (the study of macroscopic systems far from equilibrium, a rich field of research with many open problems and applications in biology).

      Reviewer #2 (Public Review):

      In this manuscript, the authors propose a computational method based on deep convolutional neural networks (CNNs) to automatically detect cell divisions in two-dimensional fluorescence microscopy timelapse images. Three deep learning models are proposed to detect the timing of division, predict the division axis, and enhance cell boundary images to segment cells before and after division. Using this computational pipeline, the authors analyze the dynamics of cell divisions in the epithelium of the Drosophila pupal wing and find that a wound first induces a reduction in the frequency of division followed by a synchronised burst of cell divisions about 100 minutes after its induction.

      In general, novelty over previous work does not seem particularly important. From a methodological point of view, the models are based on generic architectures of convolutional neural networks, with minimal changes, and on ideas already explored in general. The authors seem to have missed much (most?) of the literature on the specific topic of detecting mitotic events in 2D timelapse images, which has been published in more specialized journals or Proceedings. (TPMAI, CCVPR etc., see references below). Even though the image modality or biological structure may be different (non-fluorescent images sometimes), I don't believe it makes a big difference. How the authors' approach compares to this previously published work is not discussed, which prevents me from objectively assessing the true contribution of this article from a methodological perspective.

      On the contrary, some competing works have proposed methods based on newer - and generally more efficient - architectures specifically designed to model temporal sequences (Phan 2018, Kitrungrotsakul 2019, 2021, Mao 2019, Shi 2020). These natural candidates (recurrent networks, long-short-term memory (LSTM) gated recurrent units (GRU), or even more recently transformers), coupled to CNNs are not even mentioned in the manuscript, although they have proved their generic superiority for inference tasks involving time series (Major point 2). Even though the original idea/trick of exploiting the different channels of RGB images to address the temporal aspect might seem smart in the first place - as it reduces the task of changing/testing a new architecture to a minimum - I guess that CNNs trained this way may not generalize very well to videos where the temporal resolution is changed slightly (Major point 1). This could be quite problematic as each new dataset acquired with a different temporal resolution or temperature may require manual relabeling and retraining of the network. In this perspective, recent alternatives (Phan 2018, Gilad 2019) have proposed unsupervised approaches, which could largely reduce the need for manual labeling of datasets.

      We thank the reviewer for their constructive comments. Our goal is to develop a cell detection method that has a very high accuracy, which is critical for practical and effective application to biological problems. The algorithms need to be robust enough to cope with the difficult experimental systems we are interested in studying, which involve densely packed epithelial cells within in vivo tissues that are continuously developing, as well as repairing. In response to the above comments of the reviewer, we apologise for not including these important papers from the division detection and deep learning literature, which are now discussed in the Introduction (on page 4).

      A key novelty of our approach is the use of multiple fluorescent channels to increase information for the model. As the referee points out, our method benefits from using and adapting existing highly effective architectures. Hence, we have been able to incorporate deeper models than some others have previously used. An additional novelty is using this same model architecture (retrained) to detect cell division orientation. For future practical use by us and other biologists, the models can easily be adapted and retrained to suit experimental conditions, including different multiple fluorescent channels or number of time points. Unsupervised approaches are very appealing due to the potential time saved compared to manual hand labelling of data. However, the accuracy of unsupervised models are currently much lower than that of supervised (as shown in Phan 2018) and most importantly well below the levels needed for practical use analysing inherently variable (and challenging) in vivo experimental data.

      Regarding the other convolutional neural networks described in the manuscript:

      (1) The one proposed to predict the orientation of mitosis performs a regression task, predicting a probability for the division angle. The architecture, which must be different from a simple Unet, is not detailed anywhere, so the way it was designed is difficult to assess. It is unclear if it also performs mitosis detection, or if it is instead used to infer orientation once the timing and location of the division have been inferred by the previous network.

      The neural network used for U-NetOrientation has the same architecture as U-NetCellDivision10 but has been retrained to complete a different task: finding division orientation. Our workflow is as follows: firstly, U-NetCellDivision10 is used to find cell divisions; secondly, U-NetOrientation is applied locally to determine the division orientation. These points have now been clarified in the main text on Page 14.

      (2) The one proposed to improve the quality of cell boundary images before segmentation is nothing new, it has now become a classic step in segmentation, see for example Wolny et al. eLife 2020.

      We have cited similar segmentation models in our paper and thank the referee for this additional one. We had made an improvement to the segmentation models, using GFP-tagged E-cadherin, a protein localised in a thin layer at the apical boundary of cells. So, while this is primarily a 2D segmentation problem, some additional information is available in the z-axis as the protein is visible in 2-3 separate z-slices. Hence, we supplied this 3-focal plane input to take advantage of the 3D nature of this signal. This approach has been made more explicit in the text (Pages 14, 15) and Figure (Fig. 2D).

      As a side note, I found it a bit frustrating to realise that all the analysis was done in 2D while the original images are 3D z-stacks, so a lot of the 3D information had to be compressed and has not been used. A novelty, in my opinion, could have resided in the generalisation to 3D of the deep-learning approaches previously proposed in that context, which are exclusively 2D, in particular, to predict the orientation of the division.

      Our experimental system is a relatively flat 2D tissue with the orientation of the cell divisions consistently in the xy-plane. Hence, a 2D analysis is most appropriate for this system. With the successful application of the 2D methods already achieving high accuracy, we envision that extension to 3D would only offer a slight increase in effectiveness as these measurements have little room for improvement. Therefore, we did not extend the method to 3D here. However, of course, this is the next natural step in our research as 3D models would be essential for studying 3D tissues; such 3D models will be computationally more expensive to analyse and more challenging to hand label.

      Concerning the biological application of the proposed methods, I found the results interesting, showing the potential of such a method to automatise mitosis quantification for a particular biological question of interest, here wound healing. However, the deep learning methods/applications that are put forward as the central point of the manuscript are not particularly original.

      We thank the referee for their constructive comments. Our aim was not only to show the accuracy of our models but also to show how they might be useful to biologists for automated analysis of large datasets, which is a—if not the—bottleneck for many imaging experiments. The ability to process large datasets will improve robustness of results, as well as allow additional hypotheses to be tested. Our study also demonstrated that these models can cope with real in vivo experiments where additional complications such as progressive development, tissue wounding and inflammation must be accounted for.

      Major point 1: generalisation potential of the proposed method.

      The neural network model proposed for mitosis detection relies on a 2D convolutional neural network (CNN), more specifically on the Unet architecture, which has become widespread for the analysis of biology and medical images. The strategy proposed here exploits the fact that the input of such an architecture is natively composed of several channels (originally 3 to handle the 3 RGB channels, which is actually a holdover from computer vision, since most medical/biological images are gray images with a single channel), to directly feed the network with 3 successive images of a timelapse at a time. This idea is, in itself, interesting because no modification of the original architecture had to be carried out. The latest 10-channel model (U-NetCellDivision10), which includes more channels for better performance, required minimal modification to the original U-Net architecture but also simultaneous imaging of cadherin in addition to histone markers, which may not be a generic solution.

      We believe we have provided a general approach for practical use by biologists that can be applied to a range of experimental data, whether that is based on varying numbers of fluorescent channels and/or timepoints. We envisioned that experimental biologists are likely to have several different parameters permissible for measurement based on their specific experimental conditions e.g., different fluorescently labelled proteins (e.g. tubulin) and/or time frames. To accommodate this, we have made it easy and clear in the code on GitHub how these changes can be made. While the model may need some alterations and retraining, the method itself is a generic solution as the same principles apply to very widely used fluorescent imaging techniques.

      Since CNN-based methods accept only fixed-size vectors (fixed image size and fixed channel number) as input (and output), the length or time resolution of the extracted sequences should not vary from one experience to another. As such, the method proposed here may lack generalization capabilities, as it would have to be retrained for each experiment with a slightly different temporal resolution. The paper should have compared results with slightly different temporal resolutions to assess its inference robustness toward fluctuations in division speed.

      If multiple temporal resolutions are required for a set of experiments, we envision that the model could be trained over a range of these different temporal resolutions. Of course, the temporal resolution, which requires the largest vector would be chosen as the model's fixed number of input channels. Given the depth of the models used and the potential to easily increase this by replacing resnet34 with resnet50 or resnet101 the model would likely be able to cope with this, although we have not specifically tested this. (page 27)

      Another approach (not discussed) consists in directly convolving several temporal frames using a 3D CNN (2D+time) instead of a 2D, in order to detect a temporal event. Such an idea shares some similarities with the proposed approach, although in this previous work (Ji et al. TPAMI 2012 and for split detection Nie et al. CCVPR 2016) convolution is performed spatio-temporally, which may present advantages. How does the authors' method compare to such an (also very simple) approach?

      We thank the Reviewer for this insightful comment. The text now discusses this (on Pages 8 and 17). Key differences between the models include our incorporation of multiple light channels and the use of much deeper models. We suggest that our method allows for an easy and natural extension to use deeper models for even more demanding tasks e.g. distinguishing between healthy and defective divisions. We also tested our method with ‘difficult conditions’ such as when a wound is present; despite the challenges imposed by the wound (including the discussed reduction in fluorescent intensities near the wound edge), we achieved higher accuracy compared to Nie et al. (accuracy of 78.5% compared to our F1 score of 0.964) using a low-density in vitro system.

      Major point 2: innovatory nature of the proposed method.

      The authors' idea of exploiting existing channels in the input vector to feed successive frames is interesting, but the natural choice in deep learning for manipulating time series is to use recurrent networks or their newer and more stable variants (LSTM, GRU, attention networks, or transformers). Several papers exploiting such approaches have been proposed for the mitotic division detection task, but they are not mentioned or discussed in this manuscript: Phan et al. 2018, Mao et al. 2019, Kitrungrotaskul et al. 2019, She et al 2020.

      An obvious advantage of an LSTM architecture combined with CNN is that it is able to address variable length inputs, therefore time sequences of different lengths, whereas a CNN alone can only be fed with an input of fixed size.

      LSTM architectures may produce similar accuracy to the models we employ in our study, however due to the high degree of accuracy we already achieve with our methods, it is hard to see how they would improve the understanding of the biology of wound healing that we have uncovered. Hence, they may provide an alternative way to achieve similar results from analyses of our data. It would also be interesting to see how LTSM architectures would cope with the noisy and difficult wounded data that we have analysed. We agree with the referee that these alternate models could allow an easier inclusion of difference temporal differences in division time (see discussion on Page 20). Nevertheless, we imagine that after selecting a sufficiently large input time/ fluorescent channel input, biologists could likely train our model to cope with a range of division lengths.

      Another advantage of some of these approaches is that they rely on unsupervised learning, which can avoid the tedious relabeling of data (Phan et al. 2018, Gilad et al. 2019).

      While these are very interesting ideas, we believe these unsupervised methods would struggle under the challenging conditions within ours and others experimental imaging data. The epithelial tissue examined in the present study possesses a particularly high density of cells with overlapping nuclei compared to the other experimental systems these unsupervised methods have been tested on. Another potential problem with these unsupervised methods is the difficulty in distinguishing dynamic debris and immune cells from mitotic cells. Once again despite our experimental data being more complex and difficult, our methods perform better than other methods designed for simpler systems as in Phan et al. 2018 and Gilad et al. 2019; for example, analysis performed on lower density in vitro and unwounded tissues gave best F1 scores for a single video was 0.768 and 0.829 for unsupervised and supervised respectively (Phan et al. 2018). We envision that having an F1 score above 0.9 (and preferably above 0.95), would be crucial for practical use by biologists, hence we believe supervision is currently still required. We expect that retraining our models for use in other experimental contexts will require smaller hand labelled datasets, as they will be able to take advantage of transfer learning (see discussion on Page 4).

      References :

      We have included these additional references in the revised version of our Manuscript.

      Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231. >6000 citations

      Nie, W. Z., Li, W. H., Liu, A. A., Hao, T., & Su, Y. T. (2016). 3D convolutional networks-based mitotic event detection in time-lapse phase contrast microscopy image sequences of stem cell populations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 55-62).

      Phan, H. T. H., Kumar, A., Feng, D., Fulham, M., & Kim, J. (2018). Unsupervised two-path neural network for cell event detection and classification using spatiotemporal patterns. IEEE Transactions on Medical Imaging, 38(6), 1477-1487.

      Gilad, T., Reyes, J., Chen, J. Y., Lahav, G., & Riklin Raviv, T. (2019). Fully unsupervised symmetry-based mitosis detection in time-lapse cell microscopy. Bioinformatics, 35(15), 2644-2653.

      Mao, Y., Han, L., & Yin, Z. (2019). Cell mitosis event analysis in phase contrast microscopy images using deep learning. Medical image analysis, 57, 32-43.

      Kitrungrotsakul, T., Han, X. H., Iwamoto, Y., Takemoto, S., Yokota, H., Ipponjima, S., ... & Chen, Y. W. (2019). A cascade of 2.5 D CNN and bidirectional CLSTM network for mitotic cell detection in 4D microscopy image. IEEE/ACM transactions on computational biology and bioinformatics, 18(2), 396-404.

      Shi, J., Xin, Y., Xu, B., Lu, M., & Cong, J. (2020, November). A Deep Framework for Cell Mitosis Detection in Microscopy Images. In 2020 16th International Conference on Computational Intelligence and Security (CIS) (pp. 100-103). IEEE.

      Wolny, A., Cerrone, L., Vijayan, A., Tofanelli, R., Barro, A. V., Louveaux, M., ... & Kreshuk, A. (2020). Accurate and versatile 3D segmentation of plant tissues at cellular resolution. Elife, 9, e57613.

    1. Author Response

      Reviewer #1 (Public Review):

      Castelán-Sánchez et al. analyzed SARS-CoV-2 genomes from Mexico collected between February 2020 and November 2021. This period spans three major spikes in daily COVID-19 cases in Mexico and the rise of three distinct variants of concern (VOCs; B.1.1.7, P.1., and B.1.617.2). The authors perform careful phylogenetic analyses of these three VOCs, as well as two other lineages that rose to substantial frequency in Mexico, focusing on identifying periods of cryptic transmission (before the lineage was first detected) and introductions to and from the neighboring United States. The figures are well presented and described, and the results add to our understanding of SARS-CoV-2 in Mexico. However, I have some concerns and questions about sampling that could affect the results and conclusions. The authors do not provide any details on the distribution of samples across the various Mexican States, making it hard to evaluate several key conclusions. Although this information is provided in Supplementary Data 2, it is not presented in a way that enables the reader to evaluate if lineages were truly predominant in certain regions of the country, or if these results are attributable purely to sampling bias. Specifically, each lineage is said to be dominant in a particular state or region, but it was not clear to me if sampling across states was even at all-time points. For example, the authors state that most B.1.1.7 genome sampling is from the state of Chihuahua, but it is not clear if this was due to more sequenced samples from that region during the time that B.1.1.7 was circulating, or if the effects of B.1.1.7 were truly differential across the country. The authors do mention sequencing biases several times, but need to be more specific about the nature of this bias and how it could affect their conclusions. It is surprising to see in this manuscript that the B.1.1.7 lineage did not rise above 25% prevalence in the data presented, despite its rapid rise in prevalence in many other parts of the world. This calls into question if the presented frequencies of each lineage are truly representative of what was circulating in Mexico at the time, especially since the coordinated sampling and surveillance program across Mexico did not start until May 2021.

      We thank the reviewer for the constructive comments. We recognize the need to better explain how the sequencing efforts in the country were set up and carried out, and this has now been clarified throughout the main text (L43-51, L95-105). A new figure comparing the overall cumulative proportion of genomes generated per state between 2020-2021 is now available as Supplementary Figure 1 c. The cumulative proportion of genomes sampled across states per lineage of interest, and corresponding to the period of circulation of the given lineage, were originally provided as maps in Figures 2-4. This has been further clarified in the Results section and in the corresponding figure legends. We also now provide additional maps representing the geographic distribution of the clades identified per lineage, integrating in the figures the information previously available in Supplementary Data 2, Supplementary Figures 4 and 5. As a note, for our analyses, we used the total cumulative genome data available from the country (and not only that generated by CoViGen-Mex, representing one third of the SARS-CoV-2 genomes from Mexico). This is expected to improve any sampling biases related to the scheme adopted by CoViGenMex, and is now clearly stated in the main text.

      However, we believe that there has been a misunderstanding related to the genome sampling scheme adopted by CoViGen-Mex, as ‘coordinated sampling and surveillance program across Mexico did not start until May 2021’. Although it is true that further improvements were implemented after this date (enabling genome sampling and sequencing to become more homogenous across the country), the overall virus genome sequencing in Mexico was already sufficient from February 2021. This is represented by the cumulative number of viral genomes sequenced throughout 2020-2021 (both by CoViGen-Mex and other contributing institutions) correlating to the number of cases officially reported in the country during this time (see Supplementary Figure 1 a). This has now been clarified in the Results section (L94-105). Therefore, we hold that “SARS-CoV-2 sequencing in Mexico has been sufficient to explore the spatial and temporal frequency of viral lineages across national territory, and now to further investigate the number of lineage-specific introduction events, and to characterize the extension and geographic distribution of associated transmission chains, as we present in this study” (L102-105). In this context, “a more homogenous sampling across the country is unlikely to impact our main findings, but could i) help pinpoint additional clades we are currently unable to detect, ii) provide further details on the geographic distribution of clades across other regions of the country, and iii) deliver a higher resolution for the viral spread reconstructions we present” (discussed in L466-470).

      For the B.1.1.7 lineage in Mexico, we have clarified the issue raised as follows: “during its circulation period, most B.1.1.7 genomes from Mexico were generated from the state of Chihuahua, with these representing the earliest B.1.1.7-assigned genomes from the country. However, our phylodynamic analysis revealed that only a small proportion of these grouped within a larger clade denoting an extended transmission chain (C2a), with the rest falling within minor clusters, or representing singleton events. Relative to other states, Chihuahua generated an overall lower proportion of viral genomes throughout 2020-2021. Thus, more viral genomes sequenced from a particular state does not necessarily translate into more well-supported clades denoting extended transmission chains, whilst the geographic distribution of clades is somewhat independent to the genome sampling across the country.” (L202-211). Again, these observations are supported by a sufficient overall genome sampling from Mexico.

      We would further like to make clear that “our results confirm that the B.1.1.7 lineage reached an overall lower sampling frequency of up to 25% (relative to other virus lineages circulating in the country), as was noted prior to this study (for example, see Zárate et al. 2022)” (L189-193). As similar observations were independently made for other Latin American countries such as Brazil, Chile, and Peru (some with better genome representation than others, like Brazil https://www.gisaid.org/), it is possible that “the overall epidemiological dynamics of the B.1.1.7 in Latin America may have substantially differed from what was observed in the USA and UK. Such differences could be partly explained by competition between cocirculating lineages, exemplified in Mexico by the regional co-circulation of B.1.1.7, P.1 and B.1.1.519. Nonetheless, the lack of a representative number of viral genomes for most of these countries prevents exploring such hypothesis at a larger scale, and further highlights the need to strengthen genomic epidemiology-based surveillance across the region” (now discussed in L372-379). We hope the reviewer considers that the issues raised have now been resolved.

      Reviewer #2 (Public Review):

      The authors use a series of subsampling methods based on phylogenetic placement and geographic setting, informed by human movement data to control for differences in sampling of SARS-CoV-2 genomes across countries. Of note, the authors show that 2 variants likely arose in Mexico and spread via multiple introductions globally, while other variant waves were driven by repeat introductions into Mexico from elsewhere. Finally, they use human mobility data to assess the impact of movement on transmission within Mexico. Overall, the study is well done and provides nice data on an under-studied country. The authors take a thoughtful approach to subsampling and provide a very thorough analysis. Because of the care given to subsampling and the great challenge that proper subsampling represents for the field of phylodynamics, the paper would benefit from a more thorough exploration of how their migration-informed subsampling procedure impacts their results. This would not only help strengthen the findings of the paper, but would likely provide a useful reference for others doing similar studies. Additionally, I would suggest the authors provide a bit more discussion of this subsampling approach and how it may be useful to others in the discussion section of the paper.

      We thank the reviewer for the constructive comments, and appreciate the recognition of our sub-sampling scheme as a valuable tool with potential application in other studies. We acknowledge the need for a ‘more thorough exploration and discussion of how a different migration-informed subsampling approach could impact our results’. To address this issue, “we further sought to validate our migration-informed genome subsampling scheme (applied to B.1.617.2+, representing the best sampled lineage in Mexico). For this, an independent dataset was built using a different migration sub-sampling approach, comprising all countries represented by B.1.617.2+ sequences deposited in GISAID (available up to November 30th 2021). In order to compare the number of introduction events, the new dataset was analysed independently under a time-scaled DTA (as described in Methods Section 4).” (L517-524). In the new dataset, <100 genome sequences from the USA were retained for further analysis (Supplementary Figure 2b), compared to approximately 2000 ‘USA’ genome sequences included in the original B.1.617.2+ alignment. Thus, we expected a lower number of inferred introduction events into Mexico, as an undersampling of viral genome sequences from the USA is likely to result in ‘Mexico’ clades not fully segregating (particularly impacting C5d).

      Our original results revealed a minimum number of 142 introduction events into Mexico (95% HPD interval = [125-148]), with 6 clades identified as denoting extended transmission chains. The DTA results derived from the new dataset (subsampling all countries) revealed a minimum number of 84 introduction events into Mexico (95% HPD interval = [81-87]), with again 6 major clades identified. Thus, a significantly lower number of introduction events into Mexico were inferred, as was expected. On the other hand, the number of clades identified were consistent between both datasets, supporting for the robustness of our phylogenetic methodological approach. However, in the new dataset, we observe that C5d displayed a reduced diversity (represented by the AY.113 and AY.100 genomes from Mexico, but excluded the B.1.617.2 genome sampled from the USA). This highlights the relevance of our genome sub-sampling using migration data as a proxy.

      In further agreement with these observations, publicly available data on global human mobility (https://migration-demography-tools.jrc.ec.europa.eu/data- hub/index.html?state=5d6005b30045242cabd750a2) shows that migration into Mexico is mostly represented by movements from the USA, followed by Indonesia, Guatemala, Belize and Colombia and Belize. However, the volume of movements from the USA into Mexico is much higher (up to 6 orders of magnitude above the volumes recorded into Mexico from any other country).

      Given time constraints related to performing additional analyses, we decided to exclude the subsampling scheme for ‘top ten countries’ suggested by the reviewer. However, we consider that the results derived from the comparison between the original and the new dataset (top-5 vs all countries) is sufficient to support for our migration-informed subsampling approach. A full description of the methodology and the result obtained, as well as a short discussion, is now available as Supplementary Text 2, and Supplementary Figure 2b and 2c. We hope the reviewer considers that the issues raised has been addressed.

    1. Author Response

      Reviewer #1 (Public Review):

      High resolution mechanistic studies would be instrumental in driving the development of Cas7-11 based biotechnology applications. This work is unfortunately overshadowed by a recent Cell publication (PMID: 35643083) describing the same Cas7-11 RNA-protein complex. However, given the tremendous interest in these systems, it is my opinion that this independent study will still be well cited, if presented well. The authors obviously have been trying to establish a unique angle for their story, by probing deeper into the mechanism of crRNA processing and target RNA cleavage. The study is carried out rigorously. The current version of the manuscript appears to have been rushed out. It would benefit from clarification and text polishing.

      We thank the reviewer for the positive and helpful comments that have made the manuscript more impactful.

      To summarize the revisions, we have resolved the metal-dependence issue, updated the maps in both main and supplementary figures that support the model, re-organized the labels for clarity, and added the comparison between our and Kato et al.’ structures.

      In addition, we describe a new result with an isolated C7L.1 fragment that retains the processing and crRNA binding activities.

      Reviewer #2 (Public Review):

      In this manuscript, Gowswami et al. solved a cryo-EM structure of Desulfonema ishimotonii Cas7-11 (DiCas7-11) bound to a guiding CRISPR RNA (crRNA) and target RNA. Cas7-11 is of interest due to its unusual architecture as a single polypeptide, in contrast to other type III CRISPR-Cas effectors that are composed of several different protein subunits. The authors have obtained a high-quality cryo-EM map at 2.82 angstrom resolution, allowing them to build a structural model for the protein, crRNA and target RNA. The authors used the structure to clearly identify a catalytic histidine residue in the Cas7-11 Cas7.1 domain that is important for crRNA processing activity. The authors also investigated the effects of metal ions and crRNAtarget base pairing on target RNA cleavage. Finally, the authors used their structure to guide engineering of a compact version of Cas7-11 in which an insertion domain that is disordered in the cryo-EM map was removed. This compact Cas7-11 appears to have comparable cleavage activity to the full-length protein.

      The cryo-EM map presented in this manuscript is generally of high quality and the manuscript is very well illustrated. However, some of the map interpretation requires clarification (outlined below). This structure will be valuable as there is significant interest in DiCas7-11 for biotechnology. Indeed, the authors have begun to engineer the protein based on observations from the structure. Although characterization of this engineered Cas7-11 is limited in this study and similar engineering was also performed in a recently published paper (PMID 35643083), this proof-of-principle experiment demonstrates the importance of having such structural information.

      The biochemistry experiments presented in the study identify an important residue for crRNA processing, and suggest that target RNA cleavage is not fully metal-ion dependent. Most of these conclusions are based on straightforward structure-function experiments. However, some results related to target RNA cleavage are difficult to interpret as presented. Overall, while the cryo-EM data presented in this work is of high quality, both the structural model and the biochemical results require further clarification as outlined below.

      We thank the reviewer for the positive and helpful comments that have made the manuscript more impactful.

      To summarize the revisions, we have resolved the metal-dependence issue, updated the maps in both main and supplementary figures that support the model, re-organized the labels for clarity, and added the comparison between our and Kato et al.’ structures.

      In addition, we describe a new result with an isolated C7L.1 fragment that retains the processing and crRNA binding activities.

      1. The DiCas7-11 structure bound to target RNA was also recently reported by Kato et al. (PMID 35643083). The authors have not cited this work or compared the two structures. While the structures are likely quite similar, it is notable that the structure reported in the current paper is for the wild-type protein and the sample was prepared under reactive conditions, resulting in a partially cleaved target. Kato et al. used a catalytically dead version of Cas7-11 in which the target RNA should remain fully intact. Are there differences in the Cas7-11 structure observed in the presence of a partially cleaved target RNA in comparison to the Kato et al. structure? Such a comparison is appropriate given the similarities between the two reports. A figure comparing the two structures could be included in the manuscript.

      We have added a paragraph on page 12 that describe the differences in preparation of the two complexes and their structures. We observed minor differences in the overall protein structure (r.m.s.d. 0.918 Å for 8114 atoms) but did observe quite different interactions between the protein and the first 5’-tag nucleotide (U(-15) vs. G(-15)) due to the different constructs in pre-crRNA, which suggests an importance of U(-15) in forming the processing-competent active site. We added Figure 2-figure supplementary 3 that illustrates the similarities and the differences.

      2.The cryo-EM density map is of high quality, but some of the structural model is not fully supported by the experimental data (e.g. protein loops from the alphafold model were not removed despite lack of cryo-EM density). Most importantly, there is little density for the target RNA beyond the site 1 cleavage site, suggesting that the RNA was cleaved and the product was released. However, this region of the RNA was included in the structural model. It is unclear what density this region of the target RNA model was based on. Further discussion of the interpretation of the partially cleaved target RNA is necessary. Were 3D classes observed in various states of RNA cleavage and with varied density for the product RNAs?

      We should have made it clear in the Method that multiple maps were used in building the structure but only submitted the post-processed map to reviewers. When using the Relion 4.0’s local resolution estimation-generated map, we observed sufficient density for some of the regions the reviewer is referring to. For instance, the site 1 cleavage density does support the model for the two nucleotides beyond site 1 cleavage site (see the revised Figure 1 & Figure 1- figure supplement 3).

      However, there are protein loops that remain lack of convincing density. These include 134141 and 1316-1329 that are now removed from the final coordinate.

      The “partially cleaved target RNA” phrase is a result of weak density for nucleotides downstream of site 1 (+2 and +3) but clear density flanking site 2. This feature indicates that cleavage likely had taken place at site 1 but not site 2 in most of the particles went into the reconstruction. To further clarify this phrase, we added “The PFS region plus the first base paired nucleotide (+1*) are not observed.” on page 4 and better indicate which nucleotides are or are not built in our model in Figure 1.

      1. The authors argue that site 1 cleavage of target RNA is independent of metal ions. This is a potentially interesting result, but it is difficult to determine whether it is supported by the evidence provided in the manuscript. The Methods section only describes a buffer containing 10 mM MgCl2, but does not describe conditions containing EDTA. How much EDTA was added and was MgCl2 omitted from these samples? In addition, it is unclear whether the site 1 product is visible in Figures 2d and 3d. To my eye, the products that are present in the EDTA conditions on these gels migrated slightly slower than the typical site 1 product. This may suggest an alternate cleavage site or chemistry (e.g. cyclic phosphate is maintained following cleavage). Further experimental details and potentially additional experiments are required to fully support the conclusion that site 1 cleavage may be metal independent.

      As we pointed out in response to Reviewer 1’s #8 comment, this conclusion may have been a result of using an older batch of DiCas7-11 that contains degraded fragments.

      As shown in the attached figure below, “batch Y” was an older prep from our in-house clone and “batch X” is a newer prep from the Addgene purchased clone (gel on right), and they consistently produce metal-independent (batch Y) or metal-dependent (batch X) cleavage (gel on left). It is possible that the degraded fragments in batch Y carry a metal-independent cleavage activity that is absent in the more pure batch X.

      We further performed mass spectrometry analysis of two of the degraded fragments from batch Y (indicated by arrows below) and discovered that these are indeed part of DiCas7-11. We, however, cannot rationalize, without more experimental evidence, why these fragments might have generated metal-independent cleavage at site 1. Therefore, we simply updated all our cleavage results from the new and cleaner prep (batch X) (For instance, Figure 3c). As a result, all references to “metal-independence” were removed.

      With regard to the nature of cleaved products, we found both sites could be inhibited by specific 2’-deoxy modifications, consistent with the previous observation that Type III systems generate a 2’, 3’-cyclic product in spite of the metal dependence (for instance, see Hale, C. R., Zhao, P., Olson, S., Duff, M. O., Graveley, B. R., Wells, L., ... & Terns, M. P. (2009). RNA-guided RNA cleavage by a CRISPR RNA-Cas protein complex. Cell, 139(5), 945-956.)

      We added this rationale based on the new results and believe that these characterizations are now thorough and conclusive

      1. The authors performed an experiment investigating the importance of crRNA-target base pairing on cleavage activity (Figure 3e). However, negative controls for the RNA targets in the absence of crRNA and Cas7-11 were not included in this experiment, making it impossible to determine which bands on the gel correspond to substrates and which correspond to products. This result is therefore not interpretable by the reader and does not support the conclusions drawn by the authors.

      Our original gel image (below) does contain these controls but we did not include them for the figure due to space considerations (we should have included it as a supplementary figure). We have now completely updated Figure 3e with much better quality and controls. Both the older and the updated experiments show the same results.

      Original gel for Figure 3e containing controls.

    1. Author Response:

      Reviewer #2 (Public Review):

      In this work, authors investigated the versatility of the beta-proteobacterium Cupriavidus necator from the proteome perspective. For this purpose, they cultivated the microorganism in a chemostat using different limiting substrates (fructose, fructose with limited ammonia, formate and succinate) and under different dilution rates. Integration of experimental proteomic data with a resource balance analysis model allowed to understand the relation between enzyme abundances and metabolic fluxes in the central metabolism. Moreover, the use of a transposon mutant library and competition experiments, could add insights regarding the essentiality of the genes studied. This shed light on the (under)utilization of metabolic enzymes, including some interpretations and speculations regarding C. necator's physiological readiness to changes in nutrients within its environmental niche. However, several parts of C. necator metabolism are not yet well analyzed (PHB biosynthesis and photorespiration) and some conclusions are not well reported.

      Strengths:

      1) The manuscript is well written, easily understandable also for (pure) experimentalists, and adds a novel layer of comprehension in the physiology and metabolism of this biotechnologically relevant microorganism. Therefore, it is likely to raise attention and be well-cited among the metabolic engineering community of this organisms.

      2) More generally, the scope of the study is broad enough to potentially attract experts in the wider-field of autotrophic/mixotrophic metabolism, especially regarding the metabolic difference in the transition from heterotrophic to autotrophic growth modes and vice versa.

      3) Findings from different experimental techniques (chemostat cultivation, proteomics, modelling, mutant libraries) complement each other and increase the level of understanding. Consistency of the results from these different angles increases the roundness of the study.

      Weaknesses:

      1) A main conclusion of this paper is that it concludes that the CCB cycle operation in heterotrophic conditions (fructose and succinate) is not useful for the biomass growth. However, Shimizu et al., 2015 claim that the CBB cycle has a benefit for at least PHB production is increased, in the presence of the CCB cycle (as demonstrated by a decrease in PHB production when Rubisco or cbbR are knocked out). In this work the authors do not analyze PHB production, but they do analyze fitness in mutant libraries. They claim not see this benefit in this study, however in their data (Figure 5 F) also small fitness drops are seen for cbbR mutants on fructose, as well as on succinate. So I think the authors have to revisit this conclusion. The type of modelling they use (RBA/FBA) may not explain such re-assimilation as 'a theoretically efficient' route, as this type of modelling assumes ' stochiometric' metabolic efficiency with setting a maximum growth objective, which is not what seems to happen in reality fully.

      We agree that a minor decrease in fitness is visible for cbbR transposon mutants in heterotrophic conditions (Figure 5F). However, we have noticed that small changes in fitness can occur -particularly at a late stage of cultivation- as an artifact of the sequencing method (fast growing mutants displacing slow-growing ones). A replication of the experiment with pulsed instead of continuous feed showed a slightly increased instead of decreased fitness on succinate for cbbR (Figure 5-figure supplement 1). We therefore conclude that the resolution of the transposon library experiments is not sufficient to decide if the cbbR KO mutant conveys a small fitness benefit or loss. As the reviewer correctly points out, Shimizu et al. do not show a general fitness benefit but only increased PHB yield from CO2-refixation. We have rewritten our conclusions to account for the fact that our results do not contradict the findings from Shimizu et al., but that both increased PHB production and slightly decreased fitness (= growth rate) is possible at the same time. We also toned down our conclusions such that the question of a potential small fitness burden/benefit of the CBB cycle in heterotrophic conditions remains open.

      2) The authors focus a lot on readiness as a rational, but actually cannot really prove readiness as an explanation of the expression of 'unutilized' proteome, in the manuscript they also mention that it maybe a non-optimized, recent evolutionary trait, especially for the Calvin Cycle (especially because of the observed responsiveness to PEP of the cbbR regulator). The authors should discuss and not present as if readiness is the proven hypothesis. It would be interesting (and challenging) if the authors can come up with some further suggestions how to research and potentially proof readiness or ' evolutionary inefficiency'.

      We rephrased the respective sections to highlight readiness as one potential explanation among others. We added a suggestion for an experimental strategy to test this hypothesis (laboratory evolution of lean-proteome strains).

      3) C. necator is well-known for the production of the storage polymer polyhydroxybutyrate (PHB) under nutrient-limited conditions, such as nitrogen of phosphate starvation. Even though the authors looked at such a nitrogen-limited condition ("ammonia") they do not report on the enzymes involved in this metabolism (phABC), which can be typically very abundant under these conditions. This should be discussed and ideally also analyzed. The formation of storage polymers is hard to incorporate in the flux balance analyze with growth as objective, however in real life C. necator can incorporate over 50% of carbon in PHB rather than biomass, so I suggest the authors discuss this and ideally develop a framework to analyze this, specifically for the ammonia-limited condition

      As mentioned above to Reviewer 1, we have now performed nitrogen-limited chemostat cultivations in order to disentangle the formation of biomass and PHB. We have updated our model by incorporating separate fluxes 1) to biomass, and 2) to PHB according to the experimental results. We have also analyzed the enzyme abundance and utilization for phaA (in the model reaction ACACT1r), phaB (AACOAR) and phaC (PHAS). The first two enzymes showed high abundance that increased with degree of limitation for all substrates. PHAS showed a different pattern with much lower, constant expression. All enzymes were expressed regardless of N- or C-limitation, but the model did only show utilization during N-limitation where PHB production was enforced. These results were summarized in the new Figure 3-figure supplement 2.

      4) The authors extensively discuss the CCB cycle and its proteome abundance. However during autotrophic growth also typically photorespiration/phosphoglycolate salvage pathways are required to deal with the oxygenase side-activity of Rubisco. The authors have not discuss the abundance of the enzymes involved in that key process. Recently, a publication in PNAS on C. necator showed by transcriptomics and knockout that the glycerate pathway on hydrogen and low CO2 is highly abundant (10.1073/pnas.2012288117). Would be good to include these enzymes and the oxygenase side-activity in the modelling, proteome analysis and fitness analysis. An issue with the growth on formate is that the real CO2 concentration in the cells cannot be determined well, but not feeding additional CO2, likely results in substantial oxygenase activity

      C. necator has several pathways for 2-phosphoglycolate (2-PGly) salvage, as the reviewer points out. The key enzymes for the universal 'upper part' of 2-PGly salvage, 2-PGly-phosphatase (cbbZ2, cbbZP) and glycolate dehydrogenase GDH (GlcDEF), were all quantified in our proteomics experiments. The cbbZ isoenzymes showed identical expression compared to the other cbb enzymes: highest on formate, lowest on succinate (Figure 1-figure supplement 2D). The GDH subunits encoded by GlcDEF showed no significant trend between growth rates or substrates, and were more than 10-fold lower abundant than 2-PGly-phosphatase. This is in line with the findings from Claassens et al., PNAS, 2020, that showed only a 2.5-fold upregulation of GDH transcripts in a low versus high CO2 comparison (changes on protein level are often less extreme than transcript). The same study demonstrated that the glycerate pathway is the dominant route for 2-PGly salvage and found four enzymes extremely upregulated in low CO2: glyoxylate carboligase GLXCL (H16_A3598), hydroxypyruvate isomerase HPYRI (H16_A3599), tartronate semialdehyde reductase TRSARr (H16_A3600), and glycerate kinase GLYCK (H16_B0612). Here, these enzymes showed only slightly higher abundance on formate compared to the other conditions we tested (~2-fold). The increase was much lower than what the transcriptional upregulation in Classens et al. would suggest; It is therefore difficult to say if 2-PGly salvage plays a role during formatotrophic growth. Moreover, we also investigated conditional essentiality and found that none of the 2-PGly salvage mutants showed impaired growth on formate (see Figure R1 below).

      Unfortunately there is -to our knowledge- no data available on the rate of Rubisco's oxygenation reaction during formatotrophic growth, and our bioreactor setup does not support measurement of pCO2. It is known though that only 25% of the CO2 from formic acid oxidation is consumed for biomass (Grunwald et al., Microb Biotech, 2015, http://dx.doi.org/10.1111/1751-7915.12149), effectively creating an excess intracellular CO2 supply. Further, the substrate specificity of the C. necator Rubisco for CO2 over O2 is very high, about twice that of cyanobacteria (Horken & Tabita, Arch Biochem Biophys, 1999, https://pubmed.ncbi.nlm.nih.gov/9882445/). This indirect evidence suggests that flux through this pathway is most likely marginal. We therefore decided to omit it from model simulations. We have added a paragraph summarizing our findings regarding phosphoglycolate salvage to the results section.

      Figure R1: Fitness of 2-phosphoglycolate salvage mutants during growth on three different carbon sources, fructose, formate, and succinate. Four genes essential for growth on formate were included for comparison (soluble formate dehydrogenase fdsABDG). Fitness scores are mean and standard deviation of four biological replicates.

    1. Author Response

      Reviewer #1 (Public Review):

      The idea that because the hippocampal code generates responses that match the most needed variable for each task (time or distance) makes it a predictive code is not fully proved with the analyses provided in the manuscript. For example, in the elapsed time task, there are also place cells and in the fixed-distance travel there are also cells that encode other features. This, rather than a predictive code, can be a regular sample of the environment with an overrepresentation of the more salient variable that animals need to get in order to collect rewards.

      We concur with the Reviewer’s reservation. Claims about predictive coding were removed and the following possible account explanation for over-representation was suggested instead:

      "These results underscore the flexible coding capabilities of the hippocampus, which are shaped by over-representation of salient variables associated with reward conditions. " (page 1 line 23, page 4 line 27)

      In addition, the analysis provided in the manuscript are rather simple, and better controls could be provided. Improving the analytical quantification of the results is necessary to support the main claim.

      We improved the quantification, as suggested below by specific comments of the reviewer.

      What is the relationship of each type of cell with the speed of the animal?

      The cells were assigned to the different types according to their responses while running across all speeds. However, we checked how the speed of the animal affects the peak firing rate of the cells, for each type of cell. Results of this analysis are presented in Author response image 1. Bars represent maximum firing rate of all cells of a given type across runs with the specified speed range (𝒎𝒆𝒂𝒏 ± 𝑺𝑬𝑴).

      Author response image 1.

      We did not find a significant interaction effect of the speed and the cell-type over the max firing rate (2-way Anova p>0.98).

      What is the relationship with the n of trial that the animal has run (first 10 trials, last 10 trials..)?

      Some of the animals were subjected to only one type of session. Moreover, they were sometimes trained without recording. Therefore, to answer this question we restricted our analysis to recording sessions where the animal switched from fixed-time to fixed-distance or vice versa. We checked the 20 first runs vs. the last 20 runs (data from 10 runs is not powerful enough for analysis) in See the results in Author response table 1.

      Author response table 1.

      To assess the dynamics of the coding flexibility, we defined the Time-Distance index (TDI), quantifying the balance between the proportion of distance cells and of time cells at a given time. as (NDistanceCells/NTimeCells)/(NDistanceCells+NTimeCells). The is in the range of [0 ,1] if the majority of cells are classified as distance cells, and in the range of [-1, 0] if the majority of cells are classified as time cells. Chi-square testing for differences in proportions did not reveal significant differences (after correction for multiple comparisons).

      The shaded boxes in Author response table 1 indicate the sessions which followed a transition between session types

      What is the average firing rate of each neuron?

      This information was now added to the titles of the panels in Figure 2 and Figure 2-figure supplement 1.

      Is there any relationship between intrinsic firing rate and the type of coding that the cell develops in each task?

      In Author response image 2 is a comparison of the firing rates of the Time cells vs the Distance cells.

      The distributions are similar (p = 0.975 ,and p = 0.675 for peak firing rate and mean firing rate, respectively, Kolmogorov-Smirnov (KS) test).

      Author response image 2.

      This figure was added to the supplementary figures (figure 3 - figure supplement 3)

      What is the relation of the units of each type with LFP features (theta phase, ripple recruitment)?

      We had LFP recordings for 15 out of 18 sessions. A large proportion of the cells showed phase precession (see Author response table 2). An example is shown in Author response image 3. We could not find a significant relation between phase precession and the cell type or the trial type.

      The table on the left shows the total cells analyzed, and on the right we show the percentage of cells that had a significant linear fit of the theta phase within 80% of the field width, when analyzed per time (topright) or per distance (bottom-right). FDist/Ftime are Fixed-distance and fixed-time trials and Dist/Time are the cell type.

      We did not identify ripple events during treadmill runs.

      Author response table 2

      Author response image 3

      Reviewer #3 (Public Review):

      Weaknesses:

      The original study of Kraus et al. consisted of 3 rats for which all sessions, including both training and recording, were of one type. Another 3 rats had a hybrid mixture of distance and time sessions. This is mentioned very briefly in the main text.

      It would appear that the theory of reward might lead to different predictions that could be verified by comparing these animals session to session at a finer grain. For example, are there examples of cells switching or transforming their “predictive” representations when a large number of trials in on session type is followed by a large number of trials of the opposite type?

      For another example, the transition from training to recording could give similar opportunities. It seems at least possible that ignoring these issues could cause a loss of power.

      We could not compare a particular cell for switching between encodings since the different types of trial were performed on different days. As an alternative, we compared the populations of cells within the first 20 vs. last 20 trials in recording sessions where the animal switched from fixed-time to fixed-distance or vice versa (see table below). The “Time-Distance balance index” (TDI) is defined as (#DistanceCells#TimeCells)/(#DistanceCells+#TimeCells) and is ranges between 0 and 1 if the majority of cells are classified as distance cells while between -1 to 0 if the majority of cells are classified as time cells.

      In all three animals there seems to be a change between the first 20 runs and last 20 runs of the same session, following a switch between trial types. However, this change is significant and with the expected trend only in one of the animals (BK49, p=0.02, chi-square test).

      The grayed boxes in Author response table 1 indicate the sessions which followed a transition between session types

      Some circularities in the construction and interpretation of the time-cell and distance-cell classifiers are not clearly addressed. The classifiers currently appear to be fit to predict the type of session a cell’s response patterns are observed within. But it is tautological to use the session type to define the cell type. I sense this is ultimately reasonable because of how the classifier is built, but this concern is not addressed or explained.

      We regret that the term ‘classifiers’ was not sufficiently precise. We used this term to describe the metrics designed to express the relation between the firing-time and the velocity, in order to classify cells, rather than classifiers that are fit to predict the type of session. We believe this to be the source of the apparent circularity. To circumvent this confusion, we now replaced all places where the term “classifier” was mentioned, with the term “metric”

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, May et al use H2B overexpression driven by Keratin14 Cre-mediated excision of a loxPstop cassette to quantify bulk chromatin dynamics in the live epidermis. They observe heterogeneity of H2B distribution within the basal stem cell layer and a change in distribution when the stem cells delaminate into the suprabasal layers. They further show that these chromatin rearrangements precede cell fate commitment, as detected by adding another Cre-mediated transgene on top (tetO-Cre mediated Keratin10 reporter). Finally, they generate an MST stem-loop transgene for the keratin 10 transcript and observe transcriptional bursting.

      We would like to clarify for the reviewer that the H2B system used is a transgenic allele of histone-2B-GFP that is driven directly by the Keratin-14 promoter (Kanda et al., 1998; Tumbar et al., 2004). This system does not rely on any Cre-mediated excision of the LoxP-stop cassette, and these mice do not carry Cre alleles. We will touch on this point below when addressing the comment on Cre expression in cells and the raised question on whether it influences the quantifications of chromatin compaction.

      The manuscript uses elegant in vivo imaging approaches to describe a set of observations that are logically based on a panel of studies that have used genetic approaches to dissect the role of heterochromatin and histone/DNA modifications in epidermal state transitions. In addition, the MST stem-loop analysis is a nice technical advance, confirming transcriptional bursting as a general phenomenon of how transcription is regulated in cells (see work from Daniel Larsson, Jonathan Chubb, Arjun Raj, and others).

      We thank the reviewer for their recognition of our contribution to the transcription field. To deepen the connection between our data and previous characterizations of transcriptional dynamics in other systems, we have added new analyses of K10MS2 transcriptional bursting on a finer temporal scale (Fig 5G-K). We find pervasive “transcriptional bursting,” consistent with findings in vitro and in other model organisms, and a surprising variation of burst durations. We believe these additional analyses significantly strengthen our conclusions and the relevance of our study to the overall transcription field.

      The value of the study in my view is recapitulating these known phenomena in a live tissue setting with high-quality imaging and careful quantification. Overall, the analyses appear thorough, although the overall changes appear relatively minor, which is perhaps to be expected from imaging bulk H2B distribution as a proxy for chromatin states.

      There is one major technical concern that might impact the interpretation of the data. The authors combine Cre lines for their key conclusions (Krt10 reporter and SRF KO) and analyze single cells that thus express very high levels of Cre. Knowing that Cre will target non-loxP sites and is genotoxic, it is possible that the effect of chromatin is due to high levels of Cre expression in single cells rather than specific effects due to cell state transitions. I would encourage the authors to carefully quantify the dose-dependent effects of the Cre protein (independent of the LoxP sites) on chromatin organization. Along these lines, is the phenotype of the SRF KO similar in the presence of two Cre alleles versus just one?

      Thank you for these kind words. This is an important potential caveat to consider. We believe that Cre activity does not significantly affect the chromatin compaction profiles for several reasons. First, we interrogated Cre activity. The quantifications in Figure 1A-E and Figure 2B-C are from mice containing K14H2B-GFP allele alone and do not carry any Cre allele. When these data were compared to those from mice that had been treated with a high dose of tamoxifen to induce Cre-mediated recombination in the vast majority of cells, the chromatin compaction profiles were not significantly different (Supp Fig 3C). We have added this comparison to Supplemental Figure 3 and addressed this point in the text (page 9). To further determine whether Cremediated recombination affects our measurement of chromatin compaction, we also analyzed adjacent basal cells with and without Cre activity in the same animal. K14H2BGFP; K14CreER; tdTomato mice were induced with a low dose of tamoxifen such that roughly 65% of epidermal cells underwent Cre recombination as demonstrated by expression of the tdTomato fluorescent reporter (Gallini et al., 2022). They also received a punch biopsy performed on the unimaged ear. Three days post injury and six days after Cre induction, the chromatin compaction profiles of cells positive and negative for Cre-mediated recombination were also not significantly different (Rebuttal Figure 1). Together, these direct comparisons between cells exposed to Cre activity and cells not exposed to Cre activity indicate that Cre activity at levels comparable to those used in our experiments has no measurable effect on our measurements of chromatin compaction.

      Rebuttal Figure 1: Effect of Cre expression on chromatin compaction profiles

      The second issue is the conclusion of "chromatin spinning". Concluding that chromatin is spinning would in my view require that the authors demonstrate that the nuclear envelope is not moving or is moving less than the chromatin. To support this conclusion the authors should do double imaging for example with LINC complex proteins, an ER/outer nuclear membrane marker, or equivalent.

      This is an excellent point. While we expect that the entire nucleus is spinning based on observations others have made in in vitro fibroblasts systems, we describe our observation as “chromatin spinning” instead of “nuclear spinning” because the K14H2B-GFP allele only allows us to directly visualize chromatin itself (Kumar et al., 2014; Zhu et al., 2018).

      Unfortunately, LINC complex proteins and nuclear membrane proteins have not been fluorescently tagged in mice, which prevents us from visualizing their dynamics in vivo. To establish these new tools and perform experiments would take more than a year, making it therefore beyond the scope of this current paper. Additionally, their relatively uniform distribution across the nuclear membrane would not allow us to visualize potential spinning of these components. We have made efforts towards the reviewer’s question by asking whether other compartments within the cell also spin in delaminating cells. To do this, we leveraged a mouse line developed by Claudio Franco’s lab (Barbacena et al., 2019), which fluorescently labels both the chromatin (H2B-GFP) and the Golgi (GTS-mCherry). As expected, this model showed a perinuclear and polarized Golgi in skin fibroblasts (Rebuttal Figure 2). However, this tool is incompatible with our questions in epidermal cells for a few reasons. First, the system is toxic to epithelial cells in vivo, resulting in apoptosis, nuclear fragmentation, and binucleate cells. Second, the Golgi is not discretely polarized (or even perinuclear) in epithelial cells (Rebuttal Figure 2). As such, although we observe chromatin spinning in delaminating basal cells, we are uncertain as to whether the whole nucleus or any other cellular compartments are spinning in these cells.

      Rebuttal Figure 2: Interrogation of intracellular spinning

      Given the above reasoning and efforts, we have altered the text and specified that we only have the capacity to visualize chromatin through the H2B-GFP allele and that we hypothesize the entire nucleus is spinning (page 11).

      Reviewer #2 (Public Review):

      In this work entitled "Live imaging reveals chromatin compaction transitions and dynamic transcriptional bursting during stem cell differentiation in vivo" the authors use a combination of genetic and imaging tools to characterize dynamic changes in chromatin compaction of cells undergoing epidermal stem cell differentiation and to relate chromatin compaction to transcriptional regulation in vivo. They track this phenomenon by imaging the epithelium at the ear of live mice, thus in a physiological context. By following individual nuclei expressing H2B-GFP along time ranges of hours and up to 3 days, they develop a strategy to quantify the profile of chromatin compaction across different epidermal layers based on normalized intensity profiles of H2B-GFP. They observe that cells belonging to the basal stem cell layer display a considerable level of internuclear variability in chromatin compaction that is cell-cycle independent. Instead, intercellular variability in chromatin compaction appears more related to the differentiation status of the cells as it is stable in the hours range but dynamic in the days range. The authors show that differentiated nuclei in the spinous layer exhibit higher chromatin compaction. They also identified a subset of cells in the basal stem layer with an intermediate profile of chromatin compaction and with the dynamic expression of the early differentiation marker keratin 10. Lastly, they show that the expression of keratin-10 precedes the chromatin compaction establishing relevant temporal relationships in the process of epidermal differentiation.

      This work includes a number of challenging approaches and techniques since it is carried out in living mice. Also, it provides nice tools and methods to study chromatin structure in vivo during multiple days and within a differentiation physiological system. On the other hand, the results are descriptive and, in some respect, expected in line with previous observations.

      Thank you very much for this great summary, kind words, and the recommendations listed below. We will address each of them specifically. We have also deepened the analysis of transcriptional dynamics in ways that are more comparable with how other groups have studied transcription and included those results in Figure 5.

      References

      Kanda, T., Sullivan, K.F., and Wahl, G.M. (1998). Histone–GFP fusion protein enables sensitive analysis of chromosome dynamics in living mammalian cells. Current Biology 8, 377–385. 10.1016/S09609822(98)70156-3.

      Tumbar, T., Guasch, G., Greco, V., Blanpain, C., Lowry, W.E., Rendl, M., and Fuchs, E. (2004). Defining the epithelial stem cell niche in skin. Science 303, 359–363. 10.1126/science.1092436.

      Kumar, A., Maitra, A., Sumit, M., Ramaswamy, S., and Shivashankar, G.V. (2014). Actomyosin contractility rotates the cell nucleus. Sci Rep 4, 3781. 10.1038/srep03781.

      Zhu, R., Liu, C., and Gundersen, G.G. (2018). Nuclear positioning in migrating fibroblasts. Seminars in Cell & Developmental Biology 82, 41–50. 10.1016/j.semcdb.2017.11.006.

      Sara Gallini, Nur-Taz Rahman, Karl Annusver, David G. Gonzalez, Sangwon Yun, Catherine Matte-Martone, Tianchi Xin, Elizabeth Lathrop, Kathleen C. Suozzi, Maria Kasper, Valentina Greco . Injury suppresses Ras cell competitive advantage through enhanced wild-type cell proliferation.<br /> bioRxiv 2022.01.05.475078; doi: https://doi.org/10.1101/2022.01.05.475078

      Pedro Barbacena, Marie Ouarné, Jody J Haigh, Francisca F Vasconcelos, Anna Pezzarossa, Claudio A Franco. GNrep mouse: A reporter mouse for front-rear cell polarity. Genesis 2019 Jun. DOI: 10.1002/dvg.23299

      Cristiana M Pineda, Sangbum Park, Kailin R Mesa, Markus Wolfel, David G Gonzalez, Ann M Haberman, Panteleimon Rompolas, Valentina Greco. Intravital imaging of hair follicle regeneration in the mouse. Nature Protocols 2015 July. DOI: 10.1038/nprot.2015.070

    1. Author Response:

      Reviewer #1 (Public Review):

      This manuscript describes a series of behavioral experiments in which foraging rats are subjected to a novel fear conditioning paradigm. Different groups of animals receive a shock to the dorsal surface of the body paired with either tone, an artificial owl driven forward with pneumatic pressure, or a tone/owl combination. An additional control condition pairs tone with owl alone (ie no shock is delivered). In a subsequent test, only owl+shock and tone/owl+shock animals show increased latency to forage and a withdrawal response to tone (even though owl-shock rats do not experience tone during conditioning). The authors conclude that this tone response is due to sensitization and that fear conditioning does not occur in their experimental setup.

      This approach is intriguing and the issues raised by the manuscript are extremely important for the field to consider. However, there are many ways to interpret the results as they stand. One issue of primary importance is whether it can indeed be claimed that conditioning did not readily occur in the tone+shock group. The lack of a particular behavioral conditioned reaction does not equate to an absence of conditioning. It is possible that unseen (i.e. physiological) measures of conditioning, many of which were once standard DVs in the fear conditioning literature, are present in the tone+shock group. This possibility pushes against the claim made in the title and elsewhere. These claims should be softened.

      We agree with the reviewer and now acknowledge the following caveat in the discussion (pg. 10): “…although neither the tone-shock group nor the tone-owl group showed overt manifestations of fear conditioning (as measured by fleeing or freezing) to the tone that prevented a successful procurement of food, the possibility of physiological (e.g., cardiovascular, respiratory) changes associated with tone-induced fear (Steimer, 2002) cannot be excluded in these animals…”

      Because systemic, group-level retreat CRs are not noted in the tone+shock condition, it would indeed be important to establish if there are any experimental circumstances in which tone paired with a US applied to the dorsal surface of the body can produce consistent reactions (e.g. freezing) to tone alone. Though it may seem likely that tone + dorsal shock would indeed produce freezing in a different setting, this result should not be taken for granted - we've known since the 'noisy water' experiment (Garcia & Koelling, 1966) that not every CS pairs with every US and that association can indeed be selective. A positive control would be clarifying. If the authors could demonstrate that tone+dorsal shock produces freezing to tone in a commonly used fear conditioning setup (ie standard cubicle chamber) then the lack of a retreat CR in their naturalistic paradigm would gain added meaning.

      This is an excellent suggestion. As recommended, we performed a positive control experiment where naïve rats that underwent the same subcutaneous wire implant surgery were placed in a standard experimental chamber and presented with a delayed tone-shock pairing (same tone frequency/intensity and shock intensity/duration; the 24.1 s CS duration was based on the mean CS duration of tone-shock animals in the naturalistic fear conditioning experiment). As can be seen in Author response image 1 (Figure 4 in the revised manuscript) below, these animals exhibited reliable postshock freezing in a conditioning chamber (fear conditioning day 1) and tone CS-evoked freezing in a novel chamber (tone testing day 2), indicating that our original finding (i.e., no evidence of auditory and contextual fear conditioning in an ecologically-relevant environment) is unlikely due to a dorsal neck/body shock US per se.

      Author response image 1. Auditory fear conditioning in a standard experimental chamber. (A) Illustrations of a rat implanted with wires subcutaneously in the dorsal neck/body region undergoing successive days of habituation (10 min tethered, conditioning chamber), training (a single tone CS-shock US pairing), and tone testing (context shift). (B) Mean (crimson line) and individual (gray lines) percent freezing data from 8 rats (4 females, 4 males) during training in context A: 3 min baseline (BL1, BL2, BL3); 23.1 s epoch of tone (T) excluding 1 s overlap with shock (S); 1 min postshock (PS). (C) Mean and individual percent freezing data during tone testing in context B: 1 min baseline (BL1); 3 min tone (T1, T2, T3); 1 min post-tone (PT). (D) Mean + SEM (bar) and individual (dots) percent freezing to tone CS before (Train, T) and after (Test, T1) undergoing auditory fear conditioning (paired t-test; t(7) = -3.163, p = 0.016). * p < 0.05

      The altered withdrawal trajectory seen in owl+shock and tone/owl+shock groups occurs in neither the tone+shock nor the tone+owl group, introducing the possibility that it results from the specific pairing of owl and shock. Put differently - this response may indeed by an associative CR. Do altered withdrawal angles persist if animals that receive owl+shock are exposed to owl again the next day? Do manipulations of the owl and shock that diminish fear conditioning (e.g. unpairing of owl and shock stimuli) eliminate deflected withdrawal angles when the subject is exposed to owl alone? If so, it would cut against the interpretation that fear conditioning does not occur in the setup described here, and would instead demonstrate that it is indeed central to predatory defense. This interpretation is compatible with the effect of hippocampal lesion on freezing evoked by a live predator. Destruction of the rat hippocampus diminishes cat-evoked freezing - this is thought to occur because the rapid association of the cat's various features with threatening action is not formed by the rat (Fanselow, 2000, 2018). Even though this interpretation of the results differs from the authors', it in no way diminishes the interest of this work. This paradigm may indeed be a novel means by which to study rapidly acquired associations with ethological relevance. Follow-up experiments of the type described above are necessary to disambiguate opposing views of the current dataset.

      Whether “altered withdrawal angles persist if animals that receive owl+shock [a US-US pairing] are exposed to owl again the next day” is an interesting question, as it is conceivable that the owl US (Zambetti et al., 2019, iScience) can function as a CS to evoke anticipatory characteristic of the conditioned fear. This possibility is now mentioned as a caveat (pg. 10): “…the erratic escape trajectory behavior exhibited by owl-shock and tone/owl-shock animals may be indicative of rapid associative processes at work (Fanselow 2018). For example, the immediate-shock (and delayed shock-context shift) deficit in freezing (e.g., Fanselow 1986; Landeira-Fernandez et al., 2006) provides compelling evidence that postshock freezing is not a UR but rather a CR to the contextual representation CS that rapidly became associated with the footshock US. In a similar vein then the erratic escape CR topography in owl-shock and tone/owl-shock animals might represent a shift in ‘functional CR topography’ (Fanselow & Wassum 2016) resulting from the rapid association between some salient features of the owl and the dorsal neck/body shock. A rapid owl-shock association nevertheless cannot explain the owl-shock animals’ subsequent fleeing behavior to a novel tone (in the absence of owl), which likely reflects nonassociative fear.”

      Reviewer #2 (Public Review):

      This work is dealing with an interesting question whether a simple, one trial CS+US (Pavlovian) association occurs in a naturalistic environment. Pavlovian fear conditioning contains a repetition of a neutral sensory signal (tone, CS) which is paired with a mild US, usually foot-shock (<1 mA; thus, unpleasant rather than painful) and the CS+US association drives associative learning. In this paper, a single 2.5 mA electrical shock was paired with a novel 80 dB tone to monitor the occurrence of learning via measuring success rate and latency of foraging for food. Some animals experienced an owl-looming matched with the US, just before reaching the food. The authors placed hunger-motivated rats into a custom-built arena equipped with safe nest, gate, food zone as well as with a delivery of a self-controlled US (electrical shock in the neck muscle and/or owl-looming). The US was activated by the rats by approaching to the food. Thus, a conflicting situation was provoked where procuring the food is paired with an aversive conditioned signal. Four groups of rats were included in the experiments based on their conditioning types: tone+ shock, tone+ shock+ owl, shock+owl and tone+owl. Due to these conditioning procedures, none of the rat procured the food but fled to the nest. In contrast, in the retrieval phases (next two days), the tone-shock and tone-owl groups successfully procured the pellets but not the tone-shock-owl group during the conditioned tone presentation. Rats in the latter group fled to the nest upon tone presentation at the food zone. As the shock-owl animals (conditioned without tone) also fled to the nest triggered by (unfamiliar) tone presentation, their and the tone+shock+owl group's fled responses were assigned to be non-associative sensitization-like process. Furthermore, during the pre-tone trials, all groups showed similar behavior as in the tone test. These findings led the authors to conclude that classical Pavlovian fear conditioning may not present in an ecologically relevant environment.

      The raised question is relevant for broad audience of neuroscience and behavioral scientist. However, as the used fear conditioning paradigm is not a common one, it is difficult to interpret the finding. It is based on a single pairing of an unfamiliar, salient tone with a very strong (traumatizing?) electrical shock, delivered directly into the neck muscle and an innate signal (owl looming). In addition, as the tone presentation was followed by many events (gate opening, presence of food, shock and/or owl-looming) in front of the animals, it is hard to image what sort of tone association could be formed at all.

      We thank the reviewer for mentioning several important considerations. In regards to the shock amplitude used here, fear conditioning studies in rats have employed a wide range of numbers, durations and intensities of footshock; e.g., three footshocks: 1.0 mA/0.75-s and 4.0 mA/3-s (Fanselow 1984), 75 footshocks: 1 mA/2-s (Maren 1999; Zimmerman et al. 2007). Note also that 16-20 periorbital shocks (2.0 mA, 8 pulse train at 5 Hz) have been used in auditory fear conditioning in rats (Moita et al. 2003; Blair et al. 2005). Thus, it is unlikely that a single 2.5 mA dorsal neck/body shock (subcutaneous and not in the neck muscle) used in the present study is particularly traumatizing compared to higher intensity/longer duration (e.g., 4.0 mA/3-s) and far more numerous (e.g., 75) footshocks employed in fear conditioning studies.

      The relationship between footshock intensity and fear conditioning also warrants further discussion. Sigmundi, Bouton, and Bolles (1980) examined conditioned freezing in rats to 15 footshocks of 0.5, 1.0 and 2.0 mA intensities (0.5-s duration) and found that “[tone] CS-evoked freezing increased with US intensity.” In contrast, Fanselow (1984) observed relatively higher contextual freezing in rats subjected to three bouts of 1.0 mA/0.75-s than 4 mA/3-s footshocks. Irrespective, the animals that received three 4 mA/3-s footshocks still exhibited robust freezing. Based on the positive control experimental results (see above), it is unlikely that the present study’s failure to observe conditioned fear is due to the use of 2.5 mA shock intensity.

      As the animals in the present study underwent 5 baseline days of foraging (3 trials per day), they would have been habituated to the computer-controlled automated gate opening-closing and the presence of food by the time of tone-shock, tone-owl, owl-shock and tone-owl/shock events, making it unlikely that the tone would associate with the gate/food stimuli. In the employed delay conditioning configuration, the tone CS has greater temporal contiguity with the US (shock and/or owl) and the US is both novel and surprising relative to the other stimuli in the arena environment. Thus, it is more plausible that the tone CS would be associated with the intended US. In summary, we believe that if fear conditioning necessitates relatively sterile environmental settings in order to transpire, then fear conditioning would be implausible in the natural world filled with dynamic, complex stimuli.

      One could also argue that if a hungry animal does not try to collect food after an unpleasant, even a painful experience, then, it normally dies soon (thus, that is not a 'natural' behavior). The tone+shock and tone+owl groups showed similar behavioral features throughout the entire experiments and may reconcile the natural events: although these rats had had negative experience before, were still approaching to food zone due their hunger. Because of their motivation for food, the authors concluded that no association was formed. Based on this single measure, is it right to do so?

      In nature, prey animals adjust their foraging behavior to minimize danger (e.g., Stephens and Krebs 1986 Foraging Theory; Lima and Dill 1990 Can J Zool); thus, it is improbable that an aversive experience will lead to end of food seeking behavior leading to death. Indeed, Choi and Kim (2010 Proc Natl Acad Sci) employed a similar seminaturalistic environment (as the present study) and found that rats adjust their foraging behavior as a function of the predatory threat distance, consistent with the “predatory imminence” model (Fanselow and Lester 1988). Since only behavioral measures of fear were assessed (i.e., fleeing, latency to enter forage zone, pellet procurement), we now acknowledge a caveat in the discussion (see response to Reviewer 1’s comment 1). Note, however, that unlike the tone-shock paired animals that failed to flee to the tone CS and successfully procured the food pellet, the owl-shock animals exhibited robust fear behavior (promptly fled, ceasing foraging) to a novel tone.

      Reviewer #3 (Public Review):

      In this study, the authors aimed to test whether rats could be fear conditioned by pairing a subdermal electric shock to a tone, an owl-like approaching stimulus, or a combination of these in a naturalistic-like environment. The authors designed a task in which rats foraging for food were exposed to a tone paired to a shock, an owl-like stimulus, a combination of the owl and the shock, or paired the owl to a shock in a single trial. The authors indexed behaviors related to food approach after conditioning. The authors found that animals exposed to the owl-shock or the tone/owl-shock pairing displayed a higher latency to approach the food reward compared to animals that were presented with the tone-shock or the tone-owl pairing. These results suggest that pairing the owl with the shock was sufficient to induce inhibitory avoidance, whereas a single pairing of the tone-shock or the tone-owl was not. The authors concluded that standard fear conditioning does not readily occur in a naturalistic-like environment and that the inhibitory avoidance induced by the owl-shock pairing could be the result of increased sensitization rather than a fear association.

      Strengths:

      The manuscript is well-written, the behavioral assay is innovative, and the results are interesting. The inclusion of both males and females, and the behavioral sex comparison was commendable. The findings are timely and would be highly relevant to the field.

      Weaknesses:

      However, in its current state, this study does not provide convincing evidence to support their main claim that Pavlovian fear conditioning does not readily occur in naturalistic environments. The innovative task presented in this study is more akin to an inhibitory avoidance task rather than fear conditioning and should be reframed in such way.

      The reviewer’s comment is theoretically important in translating laboratory studies of fear to real world situations. Because our animals were engaged in a purposive/goal-oriented foraging behavior, that is, the leaving of nest in search of food in an open space brought about tone-shock, tone-owl, owl-shock and tone/owl shock outcomes, one can make the case that this is in principle an inhibitory avoidance (instrumental fear conditioning) task rather than a Pavlovian fear conditioning task. A pertinent question then is whether procedurally ‘pure’ laboratory Pavlovian conditioning tasks (i.e., displacing animals from their home cage to an experimental chamber and presenting CS and US) are possible in real world settings where behaviors of animals and humans are largely purposive/goal-oriented (Tolman 1948 Psychol Rev). It is generally accepted that “Outside the laboratory, stimulus [Pavlovian] learning and response [Instrumental] learning are almost inseparable (Bouton 2007 Learning and Behavior, pg. 28).” The goal of our study was to investigate whether widely-employed auditory fear conditioning readily produces associative fear memory that guides future behavior in animals performing naturalistic foraging behavior, and insofar as presenting a salient tone CS followed by an aversive shock US, the present study has a Pavlovian fear component.

      We thank the reviewer for raising this concern and have addressed the Pavlovian vs. Instrumental fear conditioning aspects of our study in the revised manuscript (pg. 10): “…there are obvious procedural differences between standard fear conditioning versus naturalistic fear conditioning. In the former paradigm, typically ad libitum fed animals are placed in an experimental chamber for a fixed time before receiving a CS-US pairing (irrespective of their ongoing behavior). Thus, the CS duration and ISI are constant across subjects. In our study, hunger-motivated rats searching for food must navigate to a fixed location in a large arena before experiencing a CS-US pairing (instrumental- or response-contingent). Because animals approach the US trigger zone at different latencies, the CS duration and ISI are variable across subjects.”

      References

      Bernstein, I. L., Vitiello, M. V., & Sigmundi, R. A. (1980). Effects of interference stimuli on the acquisition of learned aversions to foods in the rat. J Comp Physiol Psychol, 94(5), 921-931. doi:10.1037/h0077807

      Blair, H. T., Huynh, V. K., Vaz, V. T., Van, J., Patel, R. R., Hiteshi, A. K., . . . Tarpley, J. W. (2005). Unilateral storage of fear memories by the amygdala. J Neurosci, 25(16), 4198-4205. doi:10.1523/JNEUROSCI.0674-05.2005

      Bouton, M. E. (2007). Learning and Behavior: Sinauer Associates

      Choi, J. S., & Kim, J. J. (2010). Amygdala regulates risk of predation in rats foraging in a dynamic fear environment. Proc Natl Acad Sci U S A, 107(50), 21773-21777. doi:10.1073/pnas.1010079108

      Fanselow, M. S. (1984). Shock-induced analgesia on the formalin test: effects of shock severity, naloxone, hypophysectomy, and associative variables. Behav Neurosci, 98(1), 79-95. doi:10.1037//0735-7044.98.1.79

      Fanselow, M. S. (1986). Associative Vs Topographical Accounts of the Immediate Shock Freezing Deficit in Rats - Implications for the Response Selection-Rules Governing Species-Specific Defensive Reactions. Learning and Motivation, 17(1), 16-39. doi:Doi 10.1016/0023-9690(86)90018-4

      Fanselow, M. S. (2018). The Role of Learning in Threat Imminence and Defensive Behaviors. Curr Opin Behav Sci, 24, 44-49. doi:10.1016/j.cobeha.2018.03.003

      Fanselow, M. S., & Lester, L. S. (1988). A functional behavioristic approach to aversively motivated behavior: Predatory imminence as a determinant of the topography of defensive behavior: Lawrence Erlbaum Associates Inc.

      Fanselow, M. S., & Wassum, K. M. (2016). The Origins and Organization of Vertebrate Pavlovian Conditioning. Cold Spring Harbor Perspectives in Biology, 8(1). doi:ARTN a021717 10.1101/cshperspect.a021717

      Landeira-Fernandez, J., DeCola, J. P., Kim, J. J., & Fanselow, M. S. (2006). Immediate shock deficit in fear conditioning: effects of shock manipulations. Behav Neurosci, 120(4), 873-879. doi:10.1037/0735-7044.120.4.873

      Lima, S. L., & Dill, L. M. (1990). Behavioral Decisions Made under the Risk of Predation - a Review and Prospectus. Canadian Journal of Zoology, 68(4), 619-640. doi:DOI 10.1139/z90-092

      Maren, S. (1999). Neurotoxic basolateral amygdala lesions impair learning and memory but not the performance of conditional fear in rats. J Neurosci, 19(19), 8696-8703.

      Moita, M. A., Rosis, S., Zhou, Y., LeDoux, J. E., & Blair, H. T. (2003). Hippocampal place cells acquire location-specific responses to the conditioned stimulus during auditory fear conditioning. Neuron, 37(3), 485-497. doi:10.1016/s0896-6273(03)00033-3

      Sigmundi, R. A., Bouton, M. E., & Bolles, R. C. (1980). Conditioned Freezing in the Rat as a Function of Shock-Intensity and Cs Modality. Bulletin of the Psychonomic Society, 15(4), 254-256.

      Steimer, T. (2002). The biology of fear- and anxiety-related behaviors. Dialogues Clin Neurosci, 4(3), 231-249.

      Stephens, D. W., & Krebs, J. R. (1986). Foraging Theory: Princeton University Press.

      Tolman, E. C. (1948). Cognitive maps in rats and men. Psychol Rev, 55(4), 189-208. doi:10.1037/h0061626

      Zambetti, P. R., Schuessler, B. P., & Kim, J. J. (2019). Sex Differences in Foraging Rats to Naturalistic Aerial Predator Stimuli. iScience, 16, 442-452. doi:10.1016/j.isci.2019.06.011

      Zimmerman, J. M., Rabinak, C. A., McLachlan, I. G., & Maren, S. (2007). The central nucleus of the amygdala is essential for acquiring and expressing conditional fear after overtraining. Learn Mem, 14(9), 634-644. doi:10.1101/lm.607207

    1. Author Response:

      Reviewer #1 (Public Review):

      Overview

      This is a well-conducted study and speaks to an interesting finding in an important topic, whether ethological validity causes co-variation in gamma above and beyond the already present ethological differences present in systemic stimulus sensitivity.

      I like the fact that while this finding (seeing red = ethnologically valid = more gamma) seems to favor views the PI has argued for, the paper comes to a much simpler and more mechanistic conclusion. In short, it's good science.

      I think they missed a key logical point of analysis, in failing to dive into ERF <----> gamma relationships. In contrast to the modeled assumption that they have succeeded in color matching to create matched LGN output, the ERF and its distinct features are metrics of afferent drive in their own data. And, their data seem to suggest these two variables are not tightly correlated, so at very least it is a topic that needs treatment and clarity as discussed below.

      Further ERF analyses are detailed below.

      Minor concerns

      In generally, very well motived and described, a few terms need more precision (speedily and staircased are too inaccurate given their precise psychophysical goals)

      We have revised the results to clarify:

      "For colored disks, the change was a small decrement in color contrast, for gratings a small decrement in luminance contrast. In both cases, the decrement was continuously QUEST-staircased (Watson and Pelli, 1983) per participant and color/grating to 85% correct detection performance. Subjects then reported the side of the contrast decrement relative to the fixation spot as fast as possible (max. 1 s), using a button press."

      The resulting reaction times are reported slightly later in the results section.

      I got confused some about the across-group gamma analysis:

      "The induced change spectra were fit per participant and stimulus with the sum of a linear slope and up to two Gaussians." What is the linear slope?

      The slope is used as the null model – we only regarded gamma peaks as significant if they explained spectrum variance beyond any linear offsets in the change spectra. We have clarified in the Results:

      "To test for the existence of gamma peaks, we fit the per-participant, per-stimulus change spectra with three models: a) the sum of two gaussians and a linear slope, b) the sum of one Gaussian and a linear slope and c) only a linear slope (without any peaks) and chose the best-fitting model using adjusted R2-values."

      To me, a few other analyses approaches would have been intuitive. First, before averaging peak-aligned data, might consider transforming into log, and might consider making average data with measures that don't confound peak height and frequency spread (e.g., using the FWHM/peak power as your shape for each, then averaging).

      The reviewer comments on averaging peak-aligned data. This had been done specifically in Fig. 3C. Correspondingly, we understood the reviewer’s suggestion as a modification of that analysis that we now undertook, with the following steps: 1) Log-transform the power-change values; we did this by transforming into dB; 2) Derive FWHM and peak power values per participant, and then average those; we did this by a) fitting Gaussians to the per-participant, per-stimulus power change spectra, b) quantifiying FWHM as the Gaussian’s Standard Deviation, and the peak power as the Gaussian’s amplitude; 3) average those parameters over subjects, and display the resulting Gaussians. The resulting Gaussians are now shown in the new panel A in Figure 3-figure supplement 1.

      (A) Per-participant, the induced gamma power change peak in dB was fitted with a Gaussian added to an offset (for full description, see Methods). Plotted is the resulting Gaussian, with peak power and variance averaged over participants.

      Results seem to be broadly consistent with Fig. 3C.

      Moderate

      I. I would like to see a more precise treatment of ERF and gamma power. The initial slope of the ERF should, by typical convention, correlate strongly with input strength, and the peak should similarly be a predictor of such drive, albeit a weaker one. Figure 4C looks good, but I'm totally confused about what this is showing. If drive = gamma in color space, then these ERF features and gamma power should (by Occham's sledgehammer…) be correlated. I invoke the sledgehammer not the razor because I could easily be wrong, but if you could unpack this relationship convincingly, this would be a far stronger foundation for the 'equalized for drive, gamma doesn't change across colors' argument…(see also IIB below)…

      …and, in my own squinting, there is a difference (~25%) in the evoked dipole amplitudes for the vertically aligned opponent pairs of red- and green (along the L-M axis Fig 2C) on which much hinges in this paper, but no difference in gamma power for these pairs. How is that possible? This logic doesn't support the main prediction that drive matched differences = matched gamma…Again, I'm happy to be wrong, but I would to see this analyzed and explained intuitively.

      As suggested by the reviewer, we have delved deeper into ERF analyses. Firstly, we overhauled our ERF analysis to extract per-color ERF shape measures (such as timing and slope), added them as panels A and B in Figure 2-figure supplement 1:

      Figure 2-figure supplement 1. ERF and reaction time results: (A) Average pre-peak slope of the N70 ERF component (extracted from 2-12 ms before per-color, per-participant peak time) for all colors. (B) Average peak time of the N70 ERF component for all colors. […]. For panels A-C, error bars represent 95% CIs over participants, bar orientation represents stimulus orientation in DKL space. The length of the scale bar corresponds to the distance from the edge of the hexagon to the outer ring.

      We have revised the results to report those analyses:

      "The initial ERF slope is sometimes used to estimate feedforward drive. We extracted the per-participant, per-color N70 initial slope and found significant differences over hues (F(4.89, 141.68) = 7.53, pGG < 410 6). Specifically, it was shallower for blue hues compared to all other hues except for green and green-blue (all pHolm < 710-4), while it was not significantly different between all other stimulus hue pairs (all pHolm > 0.07, Figure 2-figure supplement 1A), demonstrating that stimulus drive (as estimated by ERF slope) was approximately equalized over all hues but blue.

      The peak time of the N70 component was significantly later for blue stimuli (Mean = 88.6 ms, CI95% = [84.9 ms, 92.1 ms]) compared to all (all pHolm < 0.02) but yellow, green and green-yellow stimuli, for yellow (Mean = 84.4 ms, CI95% = [81.6 ms, 87.6 ms]) compared to red and red-blue stimuli (all pHolm < 0.03), and fastest for red stimuli (Mean = 77.9 ms, CI95% = [74.5 ms, 81.1 ms]) showing a general pattern of slower N70 peaks for stimuli on the S-(L+M) axis, especially for blue (Figure 2-figure supplement 1B)."

      We also checked if our main findings (equivalence of drive-controlled red and green stimuli, weaker responses for S+ stimuli) are robust when controlled for differences in ERF parameters and added in the Results:

      "To attempt to control for potential remaining differences in input drive that the DKL normalization missed, we regressed out per-participant, per-color, the N70 slope and amplitude from the induced gamma power. Results remained equivalent along the L-M axis: The induced gamma power change residuals were not statistically different between red and green stimuli (Red: 8.22, CI95% = [-0.42, 16.85], Green: 12.09, CI95% = [5.44, 18.75], t(29) = 1.35, pHolm = 1.0, BF01 = 3.00).

      As we found differences in initial ERF slope especially for blue stimuli, we checked if this was sufficient to explain weaker induced gamma power for blue stimuli. While blue stimuli still showed weaker gamma-power change residuals than yellow stimuli (Blue: -11.23, CI95% = [-16.89, -5.57], Yellow: -6.35, CI95% = [-11.20, -1.50]), this difference did not reach significance when regressing out changes in N70 slope and amplitude (t(29) = 1.65, pHolm = 0.88). This suggests that lower levels of input drive generated by equicontrast blue versus yellow stimuli might explain the weaker gamma oscillations induced by them."

      We added accordingly in the Discussion:

      "The fact that controlling for N70 amplitude and slope strongly diminished the recorded differences in induced gamma power between S+ and S- stimuli supports the idea that the recorded differences in induced gamma power over the S-(L+M) axis might be due to pure S+ stimuli generating weaker input drive to V1 compared to DKL-equicontrast S- stimuli, even when cone contrasts are equalized.."

      Additionally, we made the correlation between ERF amplitude and induced gamma power clearer to read by correlating them directly. Accordingly, the relevant paragraph in the results now reads:

      "In addition, there were significant correlations between the N70 ERF component and induced gamma power: The extracted N70 amplitude was correlated across colors with the induced gamma power change within participants with on average r = -0.38 (CI95% = [-0.49, -0.28], pWilcoxon < 4*10-6). This correlation was specific to the gamma band and the N70 component: Across colors, there were significant correlation clusters between V1 dipole moment 68-79 ms post-stimulus onset and induced power between 28 54 Hz and 72 Hz (Figure 4C, rmax = 0.30, pTmax < 0.05, corrected for multiple comparisons across time and frequency)."

      II. As indicated above, the paper rests on accurate modeling of human LGN recruitment, based in fact on human cone recruitment. However, the exact details of how such matching was obtained were rapidly discussed-this technical detail is much more than just a detail in a study on color matching: I am not against the logic nor do I know of a flaw, but it's the hinge of the paper and is dealt with glancingly.

      A. Some discussion of model limitations

      B. Why it's valid to assume LGN matching has been achieved using data from the periphery: To buy knowledge, nobody has ever recorded single units in human LGN with these color stimuli…in contrast, the ERF is 'in their hands' and could be directly related (or not) to gamma and to the color matching predictions of their model.

      We have revised the respective paragraph of the introduction to read:

      "Earlier work has established in the non-human primate that LGN responses to color stimuli can be well explained by measuring retinal cone absorption spectra and constructing the following cone-contrast axes: L+M (capturing luminance), L-M (capturing redness vs. greenness), and S-(L+M) (capturing S-cone activation, which correspond to violet vs. yellow hues). These axes span a color space referred to as DKL space (Derrington, Krauskopf, and Lennie, 1984). This insight can be translated to humans (for recent examples, see Olkkonen et al., 2008; Witzel and Gegenfurtner, 2018), if one assumes that human LGN responses have a similar dependence on human cone responses. Recordings of human LGN single units to colored stimuli are not available (to our knowledge). Yet, sensitivity spectra of human retinal cones have been determined by a number of approaches, including ex-vivo retinal unit recordings (Schnapf et al., 1987), and psychophysical color matching (Stockman and Sharpe, 2000). These human cone sensitivity spectra, together with the mentioned assumption, allow to determine a DKL space for human observers. To show color stimuli in coordinates that model LGN activation (and thereby V1 input), monitor light emission spectra for colored stimuli can be measured to define the strength of S-, M-, and L-cone excitation they induce. Then, stimuli and stimulus background can be picked from an equiluminance plane in DKL space. "

      Reviewer #2 (Public Review):

      The major strengths of this study are the use of MEG measurements to obtain spatially resolved estimates of gamma rhythms from a large(ish) sample of human participants, during presentation of stimuli that are generally well matched for cone contrast. Responses were obtained using a 10deg diameter uniform field presented in and around the centre of gaze. The authors find that stimuli with equivalent cone contrast in L-M axis generated equivalent gamma - ie. that 'red' (+L-M) stimuli do not generate stronger responses than 'green (-L+M). The MEG measurements are carefully made and participants performed a decrement-detection task away from the centre of gaze (but within the stimulus), allowing measurements of perceptual performance and in addition controlling attention.

      There are a number of additional observations that make clear that the color and contrast of stimuli are important in understanding gamma. Psychophysical performance was worst for stimuli modulated along the +S-(L+M) direction, and these directions also evoked weakest evoked potentials and induced gamma. There also appear to be additional physiological asymmetries along non-cardinal color directions (e.g. Fig 2C, Fig 3E). The asymmetries between non-cardinal stimuli may parallel those seen in other physiological and perceptual studies and could be drawn out (e.g. Danilova and Mollon, Journal of Vision 2010; Goddard et al., Journal of Vision 2010; Lafer-Sousa et al., JOSA 2012).

      We thank the review for the pointers to relevant literature and have added in the Discussion:

      "Concerning off-axis colors (red-blue, green-blue, green-yellow and red-yellow), we found stronger gamma power and ERF N70 responses to stimuli along the green-yellow/red-blue axis (which has been called lime-magenta in previous studies) compared to stimuli along the red-yellow/green-blue axis (orange-cyan). In human studies varying color contrast along these axes, lime-magenta has also been found to induce stronger fMRI responses (Goddard et al., 2010; but see Lafer-Sousa et al., 2012), and psychophysical work has proposed a cortical color channel along this axis (Danilova and Mollon, 2010; but see Witzel and Gegenfurtner, 2013)."

      Similarly, the asymmetry between +S and -S modulation is striking and need better explanation within the model (that thalamic input strength predicts gamma strength) given that +S inputs to cortex appear to be, if anything, stronger than -S inputs (e.g. DeValois et al. PNAS 2000).

      We followed the reviewer’s suggestion and modified the Discussion to read:

      "Contrary to the unified pathway for L-M activation, stimuli high and low on the S-(L+M) axis (S+ and S ) each target different cell populations in the LGN, and different cortical layers within V1 (Chatterjee and Callaway, 2003; De Valois et al., 2000), whereby the S+ pathway shows higher LGN neuron and V1 afferent input numbers (Chatterjee and Callaway, 2003). Other metrics of V1 activation, such as ERPs/ERFs, reveal that these more numerous S+ inputs result in a weaker evoked potential that also shows a longer latency (our data; Nunez et al., 2021). The origin of this dissociation might lie in different input timing or less cortical amplification, but remains unclear so far. Interestingly, our results suggest that cortical gamma is more closely related to the processes reflected in the ERP/ERF: Stimuli inducing stronger ERF induced stronger gamma; and controlling for ERF-based measures of input drives abolished differences between S+ and S- stimuli in our data."

      Given that this asymmetry presents a potential exception to the direct association between LGN drive and V1 gamma power, we have toned down claims of a direct input drive to gamma power relationship in the Title and text and have refocused instead on L-M contrast.

      My only real concern is that the authors use a precomputed DKL color space for all observers. The problem with this approach is that the isoluminant plane of DKL color space is predicated on a particular balance of L- and M-cones to Vlambda, and individuals can show substantial variability of the angle of the isoluminant plane in DKL space (e.g. He, Cruz and Eskew, Journal of Vision 2020). There is a non-negligible chance that all the responses to colored stimuli may therefore be predicted by projection of the stimuli onto each individual's idiosyncratic Vlambda (that is, the residual luminance contrast in the stimulus). While this would be exhaustive to assess in the MEG measurements, it may be possible to assess perceptually as in the He paper above or by similar methods. Regardless, the authors should consider the implications - this is important because, for example, it may suggest that important of signals from magnocellular pathway, which are thought to be important for Vlambda.

      We followed the suggestion of the reviewer, performed additional analyses and report the new results in the following Results text:

      "When perceptual (instead of neuronal) definitions of equiluminance are used, there is substantial between-subject variability in the ratio of relative L- and M-cone contributions to perceived luminance, with a mean ratio of L/M luminance contributions of 1.5-2.3 (He et al., 2020). Our perceptual results are consistent with that: We had determined the color-contrast change-detection threshold per color; We used the inverse of this threshold as a metric of color change-detection performance; The ratio of this performance metric between red and green (L divided by M) had an average value of 1.48, with substantial variability over subjects (CI95% = [1.33, 1.66]).

      If such variability also affected the neuronal ERF and gamma power measures reported here, L/M-ratios in color-contrast change-detection thresholds should be correlated across subjects with L/M-ratios in ERF amplitude and induced gamma power. This was not the case: Change-detection threshold red/green ratios were neither correlated with ERF N70 amplitude red/green ratios (ρ = 0.09, p = 0.65), nor with induced gamma power red/green ratios (ρ = -0.17, p = 0.38)."

      Reviewer #3 (Public Review):

      This is an interesting article studying human color perception using MEG. The specific aim was to study differences in color perception related to different S-, M-, and L-cone excitation levels and especially whether red color is perceived differentially to other colors. To my knowledge, this is the first study of its kind and as such very interesting. The methods are excellent and manuscript is well written as expected this manuscript coming from this lab. However, illustrations of the results is not optimal and could be enhanced.

      Major

      The results presented in the manuscript are very interesting, but not presented comprehensively to evaluate the validity of the results. The main results of the manuscript are that the gamma-band responses to stimuli with absolute L-M contrast i.e. green and red stimuli do not differ, but they differ for stimuli on the S-(L+M) (blue vs red-green) axis and gamma-band responses for blue stimuli are smaller. These data are presented in figure 3, but in it's current form, these results are not well conveyed by the figure. The main results are illustrated in figures 3BC, which show the average waveforms for grating and for different color stimuli. While there are confidence limits for the gamma-band responses for the grating stimuli, there are no confidence limits for the responses to different color stimuli. Therefore, the main results of the similarities / differences between the responses to different colors can't be evaluated based on the figure and hence confidence limits should be added to these data.

      Figure 3E reports the gamma-power change values after alignment to the individual peak gamma frequencies, i.e. the values used for statistics, and does report confidence intervals. Yet, we see the point of the reviewer that confidence intervals are also helpful in the non-aligned/complete spectra. We found that inclusion of confidence intervals into Figure 3B,C, with the many overlapping spectra, renders those panels un-readable. Therefore, we included the new panel Figure 3-figure supplement 2A, showing each color’s spectrum separately:

      (A) Per-color average induced power change spectra. Banding shows 95% confidence intervals over participants. Note that the y-axis varies between colors.

      It is also not clear from the figure legend, from which time-window data is averaged for the waveforms.

      We have added in the legend:

      "All panels show power change 0.3 s to 1.3 s after stimulus onset, relative to baseline."

      The time-resolved profile of gamma-power changes are illustrated in Fig. 3D. This figure would a perfect place to illustrate the main results. However, of all color stimuli, these TFRs are shown only for the green stimuli, not for the red-green differences nor for blue stimuli for which responses were smaller. Why these TFRs are not showed for all color stimuli and for their differences?

      Figure 3-figure supplement 3. Per-color time-frequency responses: Average stimulus-induced power change in V1 as a function of time and frequency, plotted for each frequency.

      We agree with the reviewer that TFR plots can be very informative. We followed their request and included TFRs for each color as Figure 3-Figure supplement 3.

      Regarding the suggestion to also include TFRs for the differences between colors, we note that this would amount to 28 TFRs, one each for all color combinations. Furthermore, while gamma peaks were often clear, their peak frequencies varied substantially across subjects and colors. Therefore, we based our statistical analysis on the power at the peak frequencies, corresponding to peak-aligned spectra (Fig. 3c). A comparison of Figure 3C with Figure 3B shows that the shape of non-aligned average spectra is strongly affected by inter-subject peak-frequency variability and thereby hard to interpret. Therefore, we refrained from showing TFR for differences between colors, which would also lack the required peak alignment.

    1. Author Response:

      Reviewer #1:

      Insulin-secreting beta-cells are electrically excitable, and action potential firing in these cells leads to an increase in the cytoplasmic calcium concentration that in turn stimulates insulin release. Beta-cells are electrically coupled to their neighbours and electrical activity and calcium waves are synchronised across the pancreatic islets. How these oscillations are initiated are not known. In this study, the authors identify a subset of 'first responders' beta-cells that are the first to respond to glucose and that initiate a propagating Ca2+ wave across the islet. These cells may be particularly responsive because of their intrinsic electrophysiological properties. Somewhat unexpectedly, the electrical coupling of first responder cells appears weaker than that in the other islet cells but this paradox is well explained by the authors. Finally, the authors provide evidence of a hierarchy of beta-cells within the islets and that if the first responder cells are destroyed, other islet cells are ready to take over.

      The strengths of the paper are the advanced calcium imaging, the photoablation experiments and the longitudinal measurements (up to 48h).

      Whilst I find the evidence for the existence of first responders and hierarchy convincing, the link between the first responders in isolated individual islets and first phase insulin secretion seen in vivo (which becomes impaired in type-2 diabetes) seems somewhat overstated. It is is difficult to see how first responders in an islet can synchronise secretion from 1000s (rodents) to millions of islets (man) and it might be wise to down-tone this particular aspect.

      We thank the reviewer for highlighting this point. We acknowledge that we did not measure insulin from individual islets post first responder cell ablation, where we observed diminished first phase Ca2+. We do note that studies have linked the first phase Ca2+ response to first phase insulin release [Henquin et al, Diabetes (2006) and Head et al, Diabetes (2012)], albeit with additional amplification signals for higher glucose elevations. Thus a diminished first phase Ca2+ would imply a diminished first phase insulin (although given the amplifying signals the converse would not necessarily be the case).

      Nevertheless there are also important caveats to our experiment. Within islets we ablated a single first responder cell. In small islets this ablation diminished Ca2+ in the plane that we imaged. In larger islets this ablation did not, pointing to the presence of multiple first responder cells. Furthermore we only observed the plane of the islet containing the ablated first responder. It is possible elsewhere in the islet that [Ca2+] was not significantly disrupted. Thus even within a small islet it is possible for redundancy, where multiple first responder cells are present and that together drive first phase [Ca2+] across the islet. Loss of a single first responder cell only disrupts Ca2+ locally. That we see a relationship between the timing of the [Ca2+] response and distance from the first responder would support this notion. Results from the islet model also support this notion, where >10% of cells were required to be ablate to significantly disrupt first-phase Ca2+.

      While we already discuss the issue of redundancy in large islets and in 3D, we now briefly mention the importance of measuring insulin release.

      Reviewer #2:

      Kravets et al. further explored the functional heterogeneity in insulin-secreting beta cells in isolated mouse islets. They used slow cytosolic calcium [Ca2+] oscillations with a cycle period of 2 to several minutes in both phases of glucose-dependent beta cell activity that got triggered by a switch from unphysiologically low (2 mM) to unphysiologically high (11 mM) glucose concentration. Based on the presented evidence, they described a distinct population of beta cells responsible for driving the first phase [Ca2+] elevation and characterised it to be different from some other previously described functional subpopulations.

      Strengths:

      The study uses advanced experimental approaches to address a specific role a subpopulation of beta cells plays during the first phase of an islet response to 11 mM glucose or strong secretagogues like glibenclamide. It finds elements of a broadscale complex network on the events of the slow time scale [Ca2+] oscillations. For this, they appropriately discuss the presence of most connected cells (network hubs) also in slower [Ca2+] oscillations.

      Weakness:

      The critical weakness of the paper is the evaluation of linear regressions that should support the impact of relative proximity (Fig. 1E), of the response consistency (Fig. 2C), and of increased excitability of the first responder cells (Fig. 3B). None of the datasets provided in the submission satisfies the criterion of normality of the distribution of regression residuals. In addition, the interpretation that the majority of first responder cells retain their early response time could as well be interpreted that the majority does not.

      We thank the reviewers for their input, as it really opened multiple opportunities for us to improve our analysis and strengthen our arguments of the existence and consistency of the first responder cells. We present more detailed analysis for these respective figures below and describe how these are included in the manuscript.

      As it is described below, we performed additional in-depth analysis and statistical evaluation of the data presented in figures 1E, 2C, and 3B. We now report that two of the datasets (Fig.1 E, Fig.2 C) satisfy the criterion of normality of the distribution of regression residuals. The third dataset (Fig.3 B) does not satisfy this criterion, and we update our interpretation of this data in the text.

      Figure 1E Statistics, Scatter: We now show the slope and p-value indicating deviation of the slope from 0, and r^2 values in Fig.1 E. While the scatter is large (r^2=0.1549 in Fig.1E) for cells located at all distances from the first responder cell, we found that scatter substantially diminishes when we consider cells located closer to the first responder (r^2=0.3219 in Fig.S1 F): the response time for cells at distances up to 60 μm from the first responder cells now is shown in Fig.S1 F. The choice of 60 μm comes from it being the maximum first-to-last responder distance in our data set (see red box in Fig.1D).

      Additionally, we noticed that within larger islets there may be multiple domains with their own first responder in the center (now in Fig.S1 E) and below. Linear distance/time dependence is preserved withing each domain.

      Figure 1E Normality of residuals: We appreciate reviewer’s suggestion and now see that the original “distance vs time” dependence in Fig.1 E did not meet normality of residuals test. When plotted as distance (μm)/response time (percentile), the cumulative distribution still did not meet the Shapiro-Wilk test for normality of residuals (see QQ plot “All distances” below). However, for cells located in the 60 μm proximity of the first responder, the residuals pass the Shapiro- Wilk normality test. The QQ-plots for “up to 60 μm distances” are included in Fig.S1 G.

      Figure 2C Statistic and Scatter: After consulting a biostatistician (Dr. Laura Pyle), we realized that since the Response time during initial vs repeated glucose elevation was measured in the same islet, these were repeated measurements on the same statistical units (i.e. a longitudinal study). Therefore, it required a mixed model analysis, as opposed to simple linear regression which we used initially. We now have applied linear mixed effects model (LMEM) to LN- transformed (original data + 0.0001). The 0.0001 value was added to avoid issues of LN(0).

      We now show LMEM-derived slope and p-value indicating deviation of the slope from 0 in Fig.2 C. Further, we performed sorting of the data presented in Fig.2 C by distance to each of the first responders (now added to Fig.2D). An example of the sorted vs non-sorted time of response in the large islet with multiple first responders is added to the Source Data – Figure 1. We found a substantial improvement of the scatter in the distance- sorted data, compared to the non-sorted, which indicates that consistency of the glucose response of a cell correlates with it’s proximity to the first responder. We also discuss this in the first sub-section of the Discussion.

      Figure 2C Normality of residuals: The residuals pass Shapiro-Wilk normality test for LMEM of the LN-transformed data. We added very small number (0.0001) to all 0 values in our data set, presented in Fig.2C, D, and Fig.S4 A, to perform natural-log transformation. Details on the LMEM and it’s output are added to the Source data – Statistical analysis file.

      Figure 3B Statistic and Scatter: We now show LMEM-derived slope and p-value, indicating deviation of the slope from 0, values in Fig.3 B (below). The LMEM-derived slope has p-value of 0.1925, indicating that the slope is not significantly different from 0. This result changes our original interpretation, and we now edit the associated results and discussion.

      Figure 3B Normality of residuals: This data set does not pass Shapiro-Wilk test.

      A major issue of the work is also that it is unnecessarily complicated. In the Results section, the authors introduce a number of beta cell subpopulations: first responder cell, last responder cell, wave origin cell, wave end cell, hub-like phase 1, hub-like phase 2, and random cells, which are all defined in exclusively relative terms, regarding the time within which the cells responded, phase lags of their oscillations, or mutual distances within the islet. These cell types also partially overlap.

      To address this comment, we added Table 1 to describe the properties of these different populations.

      Their choice to use the diameter percentile as a metrics for distances between the cells is not well substantiated since they do not demonstrate in what way would the islet size variability influence the conclusion. All presented islets are of rather a comparable size within the diffusion limits.

      We replaced normalized distances in Fig.1 D with absolute distance from first responder in μm.

      The functional hierarchy of cells defining the first response should be reflected in the consistency of their relative response time. The authors claim that the spatial organisation is consistent over a time of up to 24 hours. In the first place, it is not clear why would this prolonged consistency be of an advantage in comparison to the absence of such consistency. The linear regression analysis between the initial and repeated relative activation times does suggest a significant correlation, but the distribution of regression residuals of the provided data is again not normal and non-conclusive, despite the low p-value. 50% of the cells defined a first responder in the initial stimulation were part of that subpopulation also during the second stimulation, which is rather random.

      We began to describe our analysis of the response time to initial and repeated glucose stimulation earlier in this reply. Further evidence of the distance-dependence of the consistency of the response time is now presented in Fig.S4 A: a response time consistency for cells at 60 μm, 50μm, and 40 μm proximity to the first responder. The closer a cell is located to the first responder, the higher is the consistency of its response time (the lower the scatter), below.

      If we analyze this data with a linear regression model, where the r^2 allows us to quantitatively demonstrate decrease of the scatter, we observe r^2 of 0.3013, 0.3228, 0.3674 respectively for cells at 60 μm, 50μm, and 40 μm proximity to the first responder (below). This data is not included in the manuscript because residuals do not pass Shapiro-Wilk Normality test for this model (while they do for the LMEM).

      One of the most surprising features of this study is the total lack of fast [Ca2+] oscillations, which are in mouse islets, stimulated with 11 mM glucose typically several seconds long and should be easily detected with the measurement speed used.

      Our data used in this manuscript contains Ca2+ dynamics from islets with a) slow oscillations only, b) fast oscillations superimposed on the slow oscillations, c) no obvious oscillations (likely continual spiking). Representative curves are below. Because we focused our study on the slow oscillations, we used dynamics of type (a) in our figures, which formed an impression that no fast oscillations were present. In our analysis of dynamics of type (b) we used Fourier transformation to separate slow oscillations from the fast (described in Methods). Dynamics of type (c) were excluded from the analysis of the oscillatory phase, and instead only used for the first-phase analysis. We indicate this exclusion in the methods.

      And lastly, we should also not perpetuate imprecise information about the disease if we know better. The first sentence of the Introduction section, stating that "Diabetes is a disease characterised by high blood glucose, …" is not precise. Diabetes only describes polyuria. Regarding the role of high glucose, a quote from a textbook by K. Frayn, R Evans: Human metabolism - a regulatory perspective, 4rd. 2019 „The changes in glucose metabolism are usually regarded as the "hallmark" of diabetes mellitus, and treatment is always monitored by the level of glucose in the blood. However, it has been said that if it were as easy to measure fatty acids in the blood as it is to measure glucose, we would think of diabetes mellitus mainly as a disorder of fat metabolism."

      We acknowledge that Diabetes alone refers to polyurea, and instead state Diabetes Mellitus to be more precise to the disease we refer to. We stated “Diabetes is a disease characterized by high blood glucose, ... “ as this is in line with internationally accepted diagnoses and classification criteria, such as position statements from the American Diabetes Association [‘Diagnosis and Classification of Diabetes Mellitus” AMERICAN DIABETES ASSOCIATION, DIABETES CARE, 36, (2013)]. We certainly acknowledge the glucose-centric approach to characterizing and diagnosing Diabetes Mellitus is largely born of the ease of which glucose can be measured. Thus if blood lipids could be easily measured we may be characterizing diabetes as a disease of hyperlipidemia (depending how lipidemia links with complications of diabetes).

    1. Author Response:

      Joint Public Review:

      A highly robust result when investigating how neural population activity is impacted by performance in a task is that the trial to trial correlations (noise correlations) between neurons is reduced as performance increases. However the theoretical and experimental literature so far has failed to account for this robust link since reduced noise correlations do not systematically contribute to improved availability or transmission of information (often measured using decoding of stimulus identity). This paper sets out to address this discrepancy by proposing that the key to linking noise correlations to decoding and thus bridging the gap with performance is to rethink the decoders we use : instead of decoders optimized to the specific task imposed on the animal on any given trial (A vs B / B vs C / A vs C), they hypothesize that we should favor a decoder optimized for a general readout of stimulus properties (A vs B vs C).

      To test this hypothesis, the authors use a combination of quantitative data analysis and mechanistic network modeling. Data were recorded from neuronal populations in area V4 of two monkeys trained to perform an orientation change detection task, where the magnitude of orientation change could vary across trials, and the change could happen at cued (attended) or uncued (unattended) locations in the visual field. The model, which extends previous work by the authors, reproduces many basic features of the data, and both the model and data offer support for the hypothesis.

      The reviewers agreed that this is a potentially important contribution, that addresses a widely observed, but puzzling, relation between perceptual performance and noise correlations. The clarity of the hypothesis, and the combination of data analysis and computational modelling are two essential strengths of the paper.

      Overall this paper exhibits a new factor to be taken into account when analysing neural data : the choice of decoder and in particular how general or specific the decoder is. The fact that the generality of the decoder sheds light on the much debated question of noise correlations underscores its importance. The paper therefore opens multiple avenues for future research to probe this new idea, in particular for tasks with multiple stimuli dimensions.

      Nonetheless, as detailed below, the reviewers believe the manuscript clarity could be further improved in several points, and some additional analysis of the data would provide more straightforward test of the hypothesis.

      1. It would be important to verify that the model reproduces the correlation between noise and signal correlations since this is really a key argument leading to the author's hypothesis.

      We have incorporated this verification of the model into the manuscript, as referred to below in the Results:

      “Importantly, this model reproduces the correlation between noise and signal correlations (Figure 2–figure supplement 1) observed in electrophysiological data (Cohen & Maunsell, 2009; Cohen & Kohn, 2011). This correlation between the shared noise and the shared tuning is a key component of the general decoder hypothesis. We observed this strong relationship between noise and signal correlations in our recorded neurons (Figure 2–figure supplement 1A) as well as in our modeled data (Figure 2–figure supplement 1B). Using this model, we were able to measure the relationship between noise and signal correlations for varying strengths of attentional modulation. Consistent with the predictions of the general decoder hypothesis, attention weakened the relationship between noise and signal correlations (Figure 2–figure supplement 1C).”

      The new figure is as below:

      Figure 2–figure supplement 1. The model reproduces the relationship between noise and signal correlations that is key to the general decoder hypothesis. (A) As previously observed in electrophysiological data (Cohen & Maunsell, 2009; Cohen & Kohn, 2011), we observe a strong relationship between noise and signal correlations. During additional recordings collected during most recording sessions (for Monkey 1 illustrated here, n = 37 days with additional recordings), the monkey was rewarded for passively fixating the center of the monitor while Gabors with randomly interleaved orientations were flashed at the receptive field location (‘Stim 2’ location in Figure 1C). The presented orientations spanned the full range of stimulus orientations (12 equally spaced orientations from 0 to 330 degrees). We calculated the signal correlation for each pair of units based on their mean responses to each of the 12 orientations. We define the noise correlation for each pair of units as the average noise correlation for each orientation. The plot depicts signal correlation as a function of noise correlation across all recording sessions, binned into 8 equally sized sets of unit pairs. Error bars represent SEM. (B) The model reproduces the relationship between noise and signal correlations. Signal correlation is plotted as a function of noise correlation, binned into 20 equally sized sets of unit pairs (n = 2000 neurons), for each attentional modulation strength (green: least attended; yellow: most attended). The results were averaged over 50 tested orientations. (C) The slope of the relationship between noise and signal correlations (y-axis) decreases with increasing attentional modulation (x-axis). This suggests that noise is less aligned with signal correlation with increasing attentional modulation.

      2. Testing the hypothesis of the general decoder:<br /> 2.1 In the data, the authors compare mainly the specific (stimulus) decoder and the monkey's choice decoder. The general stimulus decoder is only considered in fig. 3f, because data across multiple orientations are available only for the cued condition, and therefore the general and specific decoders cannot be compared for changes between cued and uncued. However, the hypothesized relation between mean correlations and performance should also be true within a fixed attention condition (cued), comparing sessions with larger vs. smaller correlation. In other words, if the hypothesis is correct, you should find that performance of the "most general" decoder (as in fig. 3f) correlates negatively with average noise correlations, across sessions, more so than the "most specific" decoder.<br /> We have added a new supplementary figure to the manuscript:

      Figure 3–figure supplement 1. Based on the electrophysiological data, the performance of the monkey’s decoder was more related to mean correlated variability than the performance of the specific decoder within each attention condition. (A) Within the cued attention condition, the performance of the monkey’s decoder was more related to mean correlated variability (left plot; correlation coefficient: n = 71 days, r = -0.23, p = 0.058) than the performance of the specific decoder (right plot; correlation coefficient: r = 0.038, p = 0.75). The correlation coefficients associated with the two decoders were significantly different from each other (Williams’ procedure: t = 3.8, p = 1.5 x 10^-4). Best fit lines plotted in gray. Data from both monkeys combined (Monkey 1 data shown in orange: n = 44 days; Monkey 2 data shown in purple: n = 27 days) with mean correlated variability z-scored within monkey. (B) The data within the uncued attention condition showed a similar pattern, with the performance of the monkey’s decoder more related to mean correlated variability (n = 69 days, r = -0.20, p = 0.14) than the performance of the specific decoder (r = 0.085, p = 0.51; Williams’ procedure: t = 2.0, p = 0.049). Conventions as in (A) (Monkey 1: n = 42 days – see Methods for data exclusions as in Figure 3C; Monkey 2: n = 27 days).

      2.2 In figure 3f, a more straightforward and precise comparison is to use the stimulus decoders to predict the choice, and test whether the more specific or the more general can predict choices more accurately.

      We have added a new panel to Figure 3 (Figure 3G) that illustrates the results of this analysis comparing whether the specific or more-general decoders predict the monkey’s trial-by-trial choices more accurately:

      Figure 3… (G) The more general the decoder (x-axis), the better its performance predicting the monkey’s choices on the median changed orientation trials (y-axis; the proportion of leave-one-out trials in which the decoder correctly predicted the monkey’s decision as to whether the orientation was the starting orientation or the median changed orientation). Conventions as in (F) (see Methods for n values).

      The description of this new panel in the Results section is as below:

      “Further, the more general the decoder, the better it predicted the monkey’s trial-by-trial choices on the median changed orientation trials (Figure 3G).”

      The updated Methods section describing this new panel is as below:

      “For Figure 3G, we performanced analyses similar to those performed for Figure 3F, in that we tested each stimulus decoder: ‘1 ori’ decoders (n = 8 decoders; 1 specific decoder for either the first, second, fourth, or fifth largest changed orientation, for each of the 2 monkeys), ‘2 oris’ decoders (n = 12 decoders; 1 decoder for each of the 6 combinations of 2 changed orientations, for each of the 2 monkeys), ‘3 oris’ decoders (n = 8 decoders; 1 decoder for each of the 4 combinations of 3 changed orientations, for each of the 2 monkeys), and ‘4 oris’ decoders (n = 2 decoders; 1 decoder for the 1 combination of 4 changed orientations, for each of the 2 monkeys). However, unlike in Figure 3F, where the performance of the stimulus decoders was compared to the performance of the monkey’s decoder on the median orientation-change trials, here we calculated the performance of the stimulus decoder when tasked with predicting the trial-by-trial choices that the monkey made on the median orientation-change trials. We plotted the proportion of leave-one-out trials in which each decoder correctly predicted the monkey’s choice as to whether the orientation was the starting orientation or the median changed orientation.”

      3. The main goal of the manuscript is to determine the impact of noise correlations on various decoding schemes. The figures however only show how decoding co-varies with correlations, but a direct, more causal analysis of the effect of correlations on decoding seems to be missing. Such an analysis can be obtained by comparing decoding on simultaneously recorded activity with decoding on trial-shuffled activity, in which noise-correlations are removed.

      We have added the following Discussion section to address this point:

      “The purpose of this study was to investigate the relationship between mean correlated variability and a general decoder. We made an initial test of the overarching hypothesis that observers use a general decoding strategy in feature-rich environments by testing whether a decoder optimized for a broader range of stimulus values better matched the decoder actually used by the monkeys than a specific decoder optimized for a narrower range of stimulus values. We purposefully did not make claims about the utility of correlated variability relative to hypothetical situations in which correlated variability does not exist in the responses of a group of neurons, as we suspect that this is not a physiologically realistic condition. Studies that causally manipulate the level of correlated variability in neuronal populations to measure the true physiological and behavioral effects of increasing or decreasing correlated variability levels, through pharmacological or genetic means, may provide important insights into the impact of correlated variability on various decoding strategies.”

      4. How different are the four different decoders (specific/monkey, cued/uncued)? It would be interesting to see how much they overlap. More generally, the authors should discuss the alternative that attention modulates also the readout/decoding weights, rather than or in addition to modulating V4 activity.

      We have added the following to the manuscript:

      A fixed readout mechanism

      A prior study from our lab found that attention, rather than changing the neuronal weights of the observer’s decoder, reshaped neuronal population activity to better align with a fixed readout mechanism (Ruff & Cohen, 2019). To test whether the neuronal weights of the monkey’s decoder changed across attention conditions (attended versus unattended), Ruff and Cohen switched the neuronal weights across conditions, testing the stimulus information in one attention condition with the neuronal weights from the other. They found that even with the switched weights, the performance of the monkey’s decoder was still higher in the attended condition. The results of this study support the conclusion that attention reshapes neuronal activity so that a fixed readout mechanism can better read out stimulus information. In other words, differences in the performance of the monkey’s decoder across attention conditions may be due to differences in how well the neuronal activity aligns with a fixed decoder.

      Our study extends the findings of Ruff and Cohen to test whether that fixed readout mechanism is determined by a general decoding strategy. Our findings support the hypothesis that observers use a general decoding strategy in the face of changing stimulus and task conditions. Our findings do not exclude other potential explanations for the suboptimality of the monkey’s decoder, nor do they exclude the possibility that attention modulates decoder neuronal weights. However, our findings together with those of Ruff and Cohen shed light on why neuronal decoders are suboptimal in a manner that aligns the fixed decoder axis with the correlated variability axis (Ni et al., 2018; Ruff et al., 2018).”

      5. Quantifying the link between model and data :<br /> 5.1 the text providing motivation for the model could be improved. The motivation used in the manuscript is, essentially, that the model allows to extrapolate beyond the data (more stimuli, more repetitions, more neurons). The dangers of extrapolation beyond the range of the data are however well known. A model that extrapolates beyond existing data is useful to design new experiments and test predictions, but this is not done here. Because the manuscript is about information and decoding, a better motivation is the fact that this model takes an actual image as input, and produces tuning and covariance compatible with each other because they are constrained by an actual network that processes the input (as opposed to parametric models where tuning and covariance can be manipulated independently).

      We have modified the manuscript as below:

      “Here, we describe a circuit model that we designed to allow us to compare the specific and monkey’s decoders from our electrophysiological dataset to modeled ideal specific and general decoders. The primary benefit of our model is that it can take actual images as inputs and produce neuronal tuning and covariance that are compatible with each other because of constraints from the simulated network that processed the inputs (Huang et al., 2019). Parametric models in which tuning and covariance can be manipulated independently would not provide such constraints. In our model, the mean correlated variability of the population activity is restricted to very few dimensions, matching experimentally recorded data from visual cortex demonstrating that mean correlated variability occupies a low-dimensional subset of the full neuronal population space (Ecker et al., 2014; Goris et al., 2014; Huang et al., 2019; Kanashiro et al., 2017; Lin et al., 2015; Rabinowitz et al., 2015; Semedo et al., 2019; Williamson et al., 2016).”

      “Our study also demonstrates the utility of combining electrophysiological and circuit modeling approaches to studying neural coding. Our model mimicked the correlated variability and effects of attention in our physiological data. Critically, our model produced neuronal tuning and covariance based on the constraints of an actual network capable of processing images as inputs.”

      We have also removed the Results and Discussion text that suggested that the model allowed us to extrapolate beyond the data.

      5.2 The ring structure, and the orientation of correlations (Fig 2b) seem to be key ingredients of the model, but are they based on data, or ad-hoc assumptions?

      We have modified the manuscript to clarify this point, as below:

      “As the basis for our modeled general decoder, we first mapped the n-dimensional neuronal activity of our model in response to the full range of orientations to a 2-dimensional space. Because the neurons were tuned for orientation, we could map the n-dimensional population responses to a ring (Figure 2B, C). The orientation of correlations (the shape of each color cloud in Figure 2B) was not an assumed parameter, and illustrates the outcome of the correlation structure and dimensionality modeled by our data. In Figure 2B, we can see that the fluctuations along the radial directions are much larger than those along other directions for a given orientation. This is consistent with the low-dimensional structure of the modeled neuronal activity. In our model, the fluctuations of the neurons, mapped to the radial direction on the ring, were more elongated in the unattended state (Figure 2B) than in the attended state (Figure 2C).”

      5.3 In the model, the specific decoder is quite strongly linked to correlated variability and the improvement of the general decoder is clear but incremental (0.66 vs 0.83) whereas in the data there really is no correlation at all (Fig 3c). This is a bit problematic because the author's begin by stating that specific decoders cannot explain the link between noise correlations and accuracy but their specific decoder clearly shows a link.

      We appreciate this point and have modified the manuscript as below:

      “Indeed, we found that just as the performance of the physiological monkey’s decoder was more strongly related to mean correlated variability than the performance of the physiological specific decoder (Figure 3C; see Figure 3–figure supplement 1 for analyses per attention condition), the performance of the modeled general decoder was more strongly related to mean correlated variability than the performance of the modeled specific decoder (Figure 3D). We modeled much stronger relationships to correlated variability (Figure 3D) than observed with our physiological data (Figure 3C). We observed that the correlation with specific decoder performance was significant with the modeled data but not with the physiological data. This is not surprising as we saw attentional effects, albeit small ones, on specific decoder performance with both the physiological and the modeled data (Figure 3A, B). Even small attentional effects would result in a correlation between decoder performance and mean correlated variability with a large enough range of mean correlated variability values. It is possible that with enough electrophysiological data, the performance of the specific decoder would be significantly related to correlated variability, as well. As described above, our focus is not on whether the performance of any one decoder is significantly correlated with mean correlated variability, but on which decoder provides a better explanation of the frequently observed relationship between performance and mean correlated variability. The performance of the general decoder was more strongly related to mean correlated variability than the performance of the specific decoder.”

      “Our results suggest that the relationship between behavior and mean correlated variability is more consistent with observers using a more general strategy that employs the same neuronal weights for decoding any stimulus change.”

      6. General decoder: Some parts of the text (eg. Line 60, Line 413) refer to a decoder that accounts for discrimination along different stimulus dimensions (eg. different values of orientation, or different color of the visual input). But the results of the manuscripts are about a general decoder for multiple values along a single stimulus dimension. The disconnect should be discussed, and the relation between these two scenarios explained.

      We have modified the manuscript as below:

      “Here, we report the results of an initial test of this overarching hypothesis, based on a single stimulus dimension. We used a simple, well-studied behavioral task to test whether a more-general decoder (optimized for a broader range of stimulus values along a single dimension) better explained the relationship between behavior and mean correlated variability than a more-specific decoder (optimized for a narrower range of stimulus values along a single dimension). Specifically, we used a well-studied orientation change-detection task (Cohen & Maunsell, 2009) to test whether a general decoder for the full range of stimulus orientations better explained the relationship between behavior and mean correlated variability than a specific decoder for the orientation change presented in the behavioral trial at hand.

      This test based on a single stimulus dimension is an important initial test of the general decoder hypothesis because many of the studies that found that performance increased when mean correlated variability decreased used a change-detection task…”

      “We performed this initial test of the overarching general decoder hypothesis in the context of a change-detection task along a single stimulus dimension because this type of task was used in many of the studies that reported a relationship between perceptual performance and mean correlated variability (Cohen & Maunsell, 2009; 2011; Herrero et al., 2013; Luo & Maunsell, 2015; Mayo & Maunsell, 2016; Nandy et al., 2017; Ni et al., 2018; Ruff & Cohen, 2016; 2019; Verhoef & Maunsell, 2017; Yan et al., 2014; Zénon & Krauzlis, 2012). This simple and well-studied task provided an ideal initial test of our general decoder hypothesis.

      This initial test of the general decoder hypothesis suggests that a more general decoding strategy may explain observations in studies that use a variety of behavioral and stimulus conditions.”

      “This initial study of the general decoder hypothesis tested this idea in the context of a visual environment in which stimulus values only changed along a single dimension. However, our overarching hypothesis is that observers use a general decoding strategy in the complex and feature-rich visual scenes encountered in natural environments. In everyday environments, visual stimuli can change rapidly and unpredictably along many stimulus dimensions. The hypothesis that such a truly general decoder explains the relationship between perceptual performance and mean correlated variability is suggested by our finding that the modeled general decoder for orientation was more strongly related to mean correlated variability than the modeled specific decoder (Figure 3D). Future tests of a general decoder for multiple stimulus features would be needed to determine if this decoding strategy is used in the face of multiple changing stimulus features. Further, such tests would need to consider alternative hypotheses for how sensory information is decoded when observing multiple aspects of a stimulus (Berkes et al., 2009; Deneve, 2012; Lorteije et al., 2015). Studies that use complex or naturalistic visual stimuli may be ideal for further investigations of this hypothesis.”

      7. Some statements in the discussion such as l 354 "the relationship between behavior and mean correlated variability is explained by the hypothesis that observers use a general strategy" should be qualified : the authors clearly show that the general decoder amplifies the relationship but in their own data the relationship exists already with a specific decoder.

      We have modified the manuscript as below:

      “Our results suggest that the relationship between behavior and mean correlated variability is more consistent with observers using a more general strategy that employs the same neuronal weights for decoding any stimulus change.

      “Together, these results support the hypothesis that observers use a more general decoding strategy in scenarios that require flexibility to changing stimulus conditions.”

      “This initial test of the general decoder hypothesis suggests that a more general decoding strategy may explain observations in studies that use a variety of behavioral and stimulus conditions.”

      8. Low-Dimensionality, beginning of Introduction and end of Discussion: experimentally, cortical activity is low-dimensional, and the proposed model captures that. But some of the reviewers did not understand the argument offered for why this matters, for the relation between average correlations and performance. It seems that the dimensionality of the population covariance is not relevant: The point instead is that a change in amplitude of fluctuations along the f'f' direction necessarily impact performance of a "specific" decoder, whereas changes in all other dimensions can be accounted for by the appropriate weights of the "specific" decoder. On the other hand, changes in fluctuation strength along multiple directions may impact the performance of the "general" decoder.

      We have modified the manuscript as below:

      “These observations comprise a paradox because changes in this simple measure should have a minimal effect on information coding. Recent theoretical work shows that neuronal population decoders that extract the maximum amount of sensory information for the specific task at hand can easily ignore mean correlated noise (Kafashan et al., 2021; Kanitscheider et al., 2015b; Moreno-Bote et al., 2014; Pitkow et al., 2015; Rumyantsev et al., 2020; for review, see Kohn et al., 2016). Decoders for the specific task at hand can ignore mean correlated variability because it does not corrupt the dimensions of neuronal population space that are most informative about the stimulus (Moreno-Bote et al., 2014).”

      “Our results address a paradox in the literature. Electrophysiological and theoretical evidence supports that there is a relationship between mean correlated variability and perceptual performance (Abbott & Dayan, 1999; Clery et al., 2017; Haefner et al., 2013; Jin et al., 2019; Ni et al., 2018; Ruff & Cohen, 2019; reviewed by Ruff et al., 2018). Yet, a specific decoding strategy in which different sets of neuronal weights are used to decode different stimulus changes cannot easily explain this relationship (Kafashan et al., 2021; Kanitscheider et al., 2015b; Moreno-Bote et al., 2014; Pitkow et al., 2015; Rumyantsev et al., 2020; reviewed by Kohn et al., 2016). This is because specific decoders of neuronal population activity can easily ignore changes in mean correlated noise (Moreno-Bote et al., 2014).”

    1. Author Response:

      Reviewer #1 (Public Review):

      The introduction felt a bit short. I was hoping early on I think for a hint at what biotic and abiotic factors UV could be important for and how this might be important for adaptation. A bit more on previous work on the genetics of UV pigmentation could be added too. I think a bit more on sunflowers more generally (what petiolaris is, where natural pops are distributed, etc.) would be helpful. This seems more relevant than its status as an emoji, for example.

      We had opted to provide some of the relevant background in the corresponding sections of the manuscript, but agree that it would be beneficial to expand the introduction. In the revised version of the manuscript, we have modified the introduction and the first section of Results and Discussion to include more information about wild sunflowers, possible adaptive functions of floral UV patterns, and previous work on the genetic basis of floral UV patterning. More generally, we have strived to provide more background information throughout the manuscript.

      The authors present the % of Vp explained by the Chr15 SNP. Perhaps I missed it, but it might be nice to also present the narrow sense heritability and how much of Va is explained.

      Narrow sense heritability for LUVp is extremely high in our H. annuus GWAS population; four different software [EMMAX (Kang et al., Nat Genet 2010), GEMMA (Zhou and Stephens, Nat Genet. 2012), GCTA (Yang et al., Am J Hum Genet 2011) and BOLT_LMM (Loh et al., Nat Genet 2015)] provided h2 estimates of ~1. While it is possible that these estimates are somewhat inflated by the presence of a single locus of extremely large effect, all individuals in this populations were grown at the same time under the same conditions, and limited environmental effects would therefore be expected. The percentage of additive variance explained by HaMYB111 appears therefore to be equal to the percentage of phenotypic variance (~62%).

      We have included details in the Methods section – Genome-wide association mapping, and added this information to the relevant section of the main text:

      “The chromosome 15 SNP with the strongest association with ligule UV pigmentation patterns in H. annuus (henceforth “Chr15_LUVp SNP”) explained 62% of the observed phenotypic and additive variation (narrow-sense heritability for LUVp in this dataset is ~1).”

      A few lines of discussion about why the Chr15 allele might be observed at only low frequencies in petiolaris I think would be of interest - the authors appear to argue that the same abiotic factors may be at play in petiolaris, so why don't we see this allele at frequencies higher than 2%? Is it recent? Geographically localized?

      That is a very interesting observation, and we currently do not have enough data to provide a definitive answer to why that is. From GWAS, HaMYB111 does not seem to play a measurable role in controlling variation for LUVp in H. petiolaris; Even when we repeat the GWAS with MAF > 1%, so that the Chr15_LUVp SNP would be included in the analysis, there is no significant association between that SNP and LUVp (the significant association on chr. 15 seen in the Manhattan plot for H. petiolaris is ~20 Mbp downstream of HaMYB111). The rarity of the L allele in H. petiolaris could complicate detection of a GWAS signal; on the other hand, the few H. petiolaris individuals carrying the L allele have, on average, only marginally larger LUVp than the rest of the population (LL = 0.32 allele).

      The two most likely explanations for the low frequencies of the L allele in H. petiolaris are differences in alleles, or their effect, between H. annuus and H. petiolaris; or, as suggested by the reviewer, a recent introgression. In H. annuus, the Chr15_LUVp SNP is likely not the actual causal polymorphism affecting HaMYB111 activity, but is only in LD with it (or them); this association might be absent in H. petiolaris alleles. An alternative possibility is that downstream differences in the genetic network regulating flavonol glycosides biosynthesis mask the effect of different HaMYB111 alleles.

      H. annuus and H. petiolaris hybridize frequently across their range, so this could be a recent introgression that has not established itself; alternatively, physiological differences in H. petiolaris could make the L allele less advantageous, so the introgressed allele is simply being maintained by drift (or recurring hybridization). Further analysis of genetic and functional diversity at HaMYB111 in H. petiolaris will be required to differentiate between these possibilities.

      We have added a few sentences highlighting some of these possible explanations at the end the main text of the manuscript, which now reads:

      “Despite a more limited range of variation for LUVp, a similar trend (larger UV patterns in drier, colder environments) is present also in H. petiolaris (Figure 4 – figure supplement 4). Interestingly, while the L allele at Chr_15 LUVp SNP is present in H. petiolaris (Figure 1 – figure supplement 2), it is found only at a very low frequency, and does not seem to significantly affect floral UV patterns in this species (Figure 2a). This could represent a recent introgression, since H. annuus and H. petiolaris are known to hybridize in nature (Heiser, 1947, Yatabe et al., 2007). Alternatively, the Chr_15 LUVp SNP might not be associated with functional differences in HaMYB111 in H. petiolaris, or differences in genetic networks or physiology between H. annuus and H. petiolaris could mask the effect of this allele, or limit its adaptive advantage, in the latter species.“

      Page 14: It's unclear to me why there is any need to discretize the LUVp values for the analyses presented here. Seems like it makes sense to either 1) analyze by genotype of plant at the Chr15 SNP, if known, or 2) treat it as a continuous variable and analyze accordingly.

      We designed our experiment to be a comparison between three well-defined phenotypic classes, to reduce the experimental noise inherent to pollinator visitation trials. As a consequence, intermediate phenotypic classes (0.3 < LUVp < 0.5 and 0.8 < LUVp < 0.95) are not represented in the experiment, and therefore we believe that analyzing LUVp as a continuous variable would be less appropriate in this case. In the revised manuscript, we have provided a modified Figure 4 – figure supplement 1 in which individual data points are show (colour-coded by pollinator type), as well as a fitted lines showing the general trend across the data.

      The individuals in pollinator visitation experiments were not genotyped for the Chr15_LUVp SNP; while having that information might provide a more direct link between HaMYB111 and pollinator visitation rates, our main interest in this experiment was to test the possible adaptive effects of variation in floral UV pigmentation.

      Page 14: I'm not sure you can infer selection from the % of plants grown in the experiment unless the experiment was a true random sample from a larger metapopulation that is homogenous for pollinator preference. In addition, I thought one of the Ashman papers had actually argued for intermediate level UV abundance in the presence of UV?

      We have removed mentions of selection from the sentence - while the 110 populations included in our 2019 common garden experiment were selected to represent the whole range of H. annuus, we agree that the pattern we observe is at best suggestive. We have, however, kept a modified version of the sentence in the revised version of the manuscript, since we believe that is an interesting observation. The sentence now reads:

      “Pollination rates are known to be yield-limiting in sunflower (Greenleaf and Kremen, 2006), and a strong reduction in pollination could therefore have a negative effect on fitness; consistent with this plants with very small LUVp values were rare (~1.5% of individuals) in our common garden experiment, which was designed to provide a balanced representation of the natural range of H. annuus.”. (new lines 373-378)

      It is correct that Koski et al., Nature Plants 2015 found intermediate UV patterns to increase pollen viability in excised flowers of Argentina anserina exposed to artificial UV radiation. However, the authors also remark that larger UV patterns would probably be favoured in natural environments, in which UV radiation would be more than two times higher than in their experimental setting. Additionally, when using artificial flowers, they found that pollen viability increased linearly with the size of floral UV pattern.

      More generally, as we discuss later on in the manuscript, the pollen protection mechanism proposed in Koski et al., Nature Plants 2015 is unlikely to be as important in sunflower inflorescences, which are much flatter than the bowl- shaped flowers of A. anserina; consistent with this, and contrary to what was observed for A. anserina, we found no correlation between UV radiation and floral UV patterns in wild sunflowers (Figure 4c).

      I would reduce or remove the text around L316-321. If there's good a priori reason to believe flower heat isn't a big deal (L. 323) and the experimental data back that up, why add 5 lines talking up the hypothesis?

      We had fairly strong reasons to believe temperature might play an important role in floral UV pattern diversity: a link between flower temperature and UV patterns has been proposed before (Koski et al., Current Biol 2020); a very strong correlation exists between temperature and LUVp in our dataset; and, perhaps more importantly, inflorescence temperature is known to have a major effect on pollinator attraction (Atamian et al., Science 2016; Creux et al., New Phytol 2021). While it is known that UV radiation is not particularly energetic, we didn’t mean line 323 to imply that we were sure a priori that there wouldn’t be any effect of UV patterns of inflorescence temperature.

      In the revised manuscript, we have re-organized that section and provided the information reported in line 323 (UV radiation accounts for only 3-7% of the total radiation at earth level) before the experimental results, to clarify what our thought process was in designing those experiments. The paragraph now reads:

      “By absorbing more radiation, larger UV bullseyes could therefore contribute to increasing temperature of the sunflower inflorescences, and their attractiveness to pollinators, in cold climates. However, UV wavelengths represents only a small fraction (3-7%) of the solar radiation reaching the Earth surface (compared to >50% for visible wavelengths), and might therefore not provide sufficient energy to significantly warm up the ligules (Nunez et al., 1994). In line with this observation, different levels of UV pigmentation had no effect on the temperature of inflorescences or individual ligules exposed to sunlight (Figure 4e-g; Figure 4 – figure supplement 3).”

      Page 17: The discussion of flower size is interesting. Is there any phenotypic or genetic correlation between LUVP and flower size?

      This is a really interesting question! There is no obvious genetic correlation between LUVp and flower size – in GWAS, HaMYB111 is not associated to any of the floral characteristics we measured (flowerhead diameter; disk diameter; ligule length; ligule width; relative ligule size; see Todesco et al., Nature 2020). There is also no significant association between ligule length and LUVp (R^2 = 0.0024, P = 0.1282), and only a very weak positive association between inflorescence size and LUVp (R^2 = 0.0243, P = 0.00013; see attached figure). There is, however, a stronger positive correlation between LUVp and disk size (the disk being the central part of the sunflower inflorescence, composed of the fertile florets; R^2 = 0.1478. P = 2.78 × 10-21), and as a consequence a negative correlation between LUVp and relative ligule size (that is, the length of the ligule relative to the diameter of the whole inflorescence; R^2 = 0.1216, P = 1.46 × 10-17). This means that, given an inflorescence of the same size, plants with large LUVp values will tend to have smaller ligules and larger discs. Since the disk of sunflower inflorescences is uniformly UV- absorbing, this would further increase the size of UV-absorbing region in these inflorescences.

      While it is tempting to speculate that this might be connected with regulation of transpiration (meaning that plants with larger LUVp further reduce transpiration from ligules by having smaller ligules - relative ligule size is also positively correlated with summer humidity; R^2 = 0.2536, P = 2.86 × 10_-5), there are many other fitness-related factors that could determine inflorescence size, and disk size in particular (seed size, florets/seed number...). Additionally, in common garden experiments, flowerhead size (and plant size in general) is affected by flowering time, which is also one of the reason why we use LUVp to measure floral UV patterns instead of absolute measurements of bullseye size; in a previous work from our group in Helianthus argophyllus, size measurements for inflorescence and UV bullseye mapped to the same locus as flowering time, while genetic regulation of LUVp was independent of flowering time (Moyers et al., Ann Bot 2017). Flowering time in H. annuus is known to be strongly affected by photoperiod (Blackman et al., Mol Ecol 2011), meaning that the flowering time we measured in Vancouver might not reflect the exact flowering time in the populations of origin of those plants – with consequences on inflorescence size.

      In summary, there is an interesting pattern of concordance between floral UV pattern and some aspects of inflorescence morphology, but we think it would be premature to draw any inference from them. Measurements of inflorescence parameters in natural populations would be much more informative in this respect.

      Reviewer #2 (Public Review):

      The genetic analysis is rigorously conducted with multiple Helianthus species and accessions of H. annuus. The same QTL was inputed in two Helianthus species, and fine mapped to promotor regions of HaMyb111.

      While there is a significant association at the beginning of chr. 15 in the GWAS for H. petiolaris petiolaris, we should clarify that that peak is unfortunately ~20 Mbp away from HaMYB111. While it is not impossible that the difference is due to reference biases in mapping H. petiolaris reads to the cultivated H. annuus genome, the most conservative explanation is that those two QTL are unrelated. We have clarified this in the legend to Fig. 2 in the revised manuscript.

      The allelic variation of the TF was carefully mapped in many populations and accessions. Flavonol glycosides were found to correlate spatially and developmentally in ligules and correlate with Myb111 transcript abundances, and a downstream flavonoid biosynthetic gene. Heterologous expression in Arabidopsis in Atmyb12 mutants, showed that HaMyb111 to be able to regulate flavonol glycoside accumulations, albeit with different molecules than those that accumulate in Helianthus. Several lines of evidence are consistent with transcriptional regulation of myb111 accounting for the variation in bullseye size.

      Functional analysis examined three possible functional roles, in pollinator attraction, thermal regulation of flowers, and water loss in excised flowers (ligules?), providing support for the first and last, but not the second possible functions, confirming the results of previous studies on the pollinator attraction and water loss functions for flavonol glycosides. The thermal imaging work of dawn exposed flower heads provided an elegant falsification of the temperature regulation hypothesis. Biogeographic clines in bullseye size correlated with temperature and humidity clines, providing a confirmation of the hypothesis posed by Koski and Ashmann about the patterns being consistent with Gloger's rule, and historical trends from herbaria collections over climate change and ozone depletion scenarios. The work hence represents a major advance from Moyers et al. 2017's genetic analysis of bullseyes in sunflowers, and confirms the role established in Petunia for this Myb TF for flavonoid glycoside accumulations, in a new tissue, the ligule.

      Thank you. We have specified in the legend of Fig. 4i of the revised manuscript that desiccation was measured in individual detached ligules, and added further details about the experiment in the Methods section.

      While there is a correlation between pigmentation and temperature/humidity in our dataset, it goes in the opposite direction to what would be expected under Gloger’s rule – that is, we see stronger pigmentation in drier/colder environments, contrary to what is generally observed in animals. This is also contrary to what observed in Koski and Ashman, Nature Plants 2015, where the authors found that floral UV pigmentation increased at lower latitudes and higher levels of UV radiation. While possibly rarer, such “anti-Gloger” patterns have been observed in plants before (Lev-Yadun, Plant Signal Behav 2016).

      Weakness: The authors were not able to confirm their inferences about myb111 function through direct manipulations of the locus in sunflower.

      That is unfortunately correct. Reliable and efficient transformation of cultivated sunflower (much less of wild sunflower species) has eluded the sunflower community (including our laboratories) so far – see for example discussion on the topic in Lewi et al. Agrobacterium protocols 2016, and Sujatha et al. PCTOC 2012. We had therefore to rely on heterologous complementation in Arabidopsis; while this approach has limitations, we believe that its results, given also the similarity in expression patterns between HaMYB111 and AtMYB111, and in combination with the other experiments reported in our manuscript, make a convincing case that HaMYB111 regulates flavonol glycosides accumulation in sunflower ligules.

      Given that that the flavonol glycosides that accumulate in Helianthus are different from those regulated when the gene is heterologously expressed in Arabidopsis, the biochemical function of Hamyb111, while quite reasonable, is not completely watertight. The flavonol glycosides are not fully characterized (only Ms/Ms data are provided) and named only with cryptic abbreviations in the main figures.

      We believe that the fact that expression of HaMYB111 in the Arabidopsis myb111 mutant reproduces the very same pattern of flavonol glycosides accumulation found in wild type Col-0 is proof that its biochemical function is the same as that of the endogenous AtMYB111 gene – that is, HaMYB111 induces expression of the same genes involved in flavonol glycosides biosynthesis in Arabidopsis. Differences in function between HaMYB11 and AtMYB111 would have resulted in different flavonol profiles between wild type Col-0 and 35S::HaMYB111 myb111 lines. It should be noted that the known direct targets of AtMYB111 in Arabidopsis are genes involved in the production of the basic flavonol aglycone (Strake et al., Plant J 2007). Differences in flavonol glycoside profiles between the two species are likely due to broader differences between the genetic networks regulating flavonol biosynthesis: additional layers of regulation of the genes targeted by MYB111, or differential regulation (or presence/absence variation) of genes controlling downstream flavonol glycosylation and conversion between different flavonols.

      In the revised manuscript, we have added the full names of all identified peaks to the legend of Figures 3a,b,e.

      This and the differences in metabolite accumulations between Arabidopsis and Helianthus becomes a bit problematic for the functional interpretations. And here the authors may want to re-read Gronquist et al. 2002: PNAS as a cautionary tale about inferring function from the spatial location of metabolites. In this study, the Eisner/Meinwald team discovered that imbedded in the UV-absorbing floral nectar guides amongst the expected array of flavonoid glycosides, were isoprenilated phloroglucinols, which have both UV-absorbing and herbivore defensive properties. Hence the authors may want to re-examine some of the other unidentified metabolites in the tissues of the bullseyes, including the caffeoyl quinic acids, for alternative functional hypotheses for their observed variation in bullseye size (eg. herbivore defense of ligules).

      This is a good point, and we have included a mention of a more explicit mention possible role of caffeoyl quinic acid (CQA) as a UV pigment in the main text, as well as highlighted at the end of the manuscript other possible factors that could contribute to variation for floral UV patterns in wild sunflowers.

      We should note, however, that CQA plays a considerably smaller role than flavonols in explaining UV absorbance in UV-absorbing (parts of) sunflower ligules, and the difference in abundance with respect to UV-reflecting (parts of) ligules is much less obvious than for flavonols (height of the absorbance peak is reduced only 2-3 times in UV- reflecting tissues for CQA, vs. 7-70 fold reductions for individual quercetin glycosides). Therefore, flavonols are clearly the main pigment responsible for UV patterning in ligules. This is in contrast with the situation for Hypericum calycinum reported in Gronquist et al., PNAS 2002, were dearomatized isoprenylated phloroglucinols (DIPs) are much more abundant than flavonols in most floral tissue, including petals. The localization of DIPs accumulation, in reproductive organs and on the abaxial (“lower”) side of the petals (so that they would be exposed when the flower is closed), is also more consistent with a role in prevention of herbivory; no UV pigmentation is found on the adaxial (“upper”) part of petals in this species, which would be consistent with a role in pollinator attraction.

      The hypotheses regarding a role for the flavonoid glycosides regulated by Myb111 expression in transpirational mitigation and hence conferring a selective advantage under high temperatures and low and high humidities, are not strongly supported by the data provided. The water loss data from excised flowers (or ligules-can't tell from the methods descriptions) is not equivalent to measures of transpiration rates (the stomatal controlled release of water), which are better performed with intact flowers by porometry or other forms of gas-exchange measures. Excised tissues tend to have uncontrolled stomatal function, and elevated cuticular water loss at damaged sites. The putative fitness benefits of variable bullseye size under different humidity regimes, proposed to explain the observed geographical clines in bullseye size remain untested.

      We have clarified in the text and methods section that the desiccation experiments were performed on detached ligules. We agree that the results of this experiments do not constitute a direct proof that UV patterns/flavonol levels have an impact on plant fitness under different humidities in the wild – our aim was simply to provide a plausible physiological explanation for the correlation we observe between floral UV patterns and relative humidity. However, we do believe they are strongly suggestive of a role for floral flavonol/UV patterns in regulating transpiration, which is consistent with previous observations that flowers are a major source of transpiration in plants (Galen et al., Am Nat 2000, and other references in the manuscript). As suggested also by other reviewers, we have softened our interpretation of these result to clarify that they are suggestive, but not proof, of a connection between floral UV patterns, ligule transpiration and environmental humidity levels.

      “While desiccation rates are only a proxy for transpiration in field conditions (Duursma et al. 2019, Hygen et al. 1951), and other factors might affect ligule transpiration in this set of lines, this evidence (strong correlation between LUVp and summer relative humidity; known role of flavonol glycosides in regulating transpiration; and correlation between extent of ligule UV pigmentation and desiccation rates) suggests that variation in floral UV pigmentation in sunflowers is driven by the role of flavonol glycosides in reducing water loss from ligules, with larger floral UV patterns helping prevent drought stress in drier environments.” (new lines 462-469)

      Detached ligules were chosen to avoid confounding the results should differences in the physiology of the rest of the inflorescence/plant between lines also affect rates of water loss. Desiccation/water loss measurements were performed for consistency with the experiments reported in Nakabayashi et al Plant J. 2014, in which the effects of flavonol accumulation (through overexpression of AtMYB12) on water loss/drought resistance were first reported. It should also be noted that the use of detached organs to study the effect of desiccation on transpiration, water loss and drought responses is common in literature (see for example Hygen, Physiol Plant 1951; Aguilar et al., J Exp Bot 2000; Chen et al., PNAS 2011; Egea et al., Sci Rep 2018; Duursma et al., New Phytol 2019, among others). While removing the ligules create a more stressful/artificial situation, mechanical factors are likely to affect all ligules and leaves in the same way, and we can see no obvious reason why that would affect the small LUVp group more than the large LUVp group (individuals in the two groups were selected to represent several geographically unrelated populations).

      We have included some of the aforementioned references to the main text and Methods sections in the revised manuscript to support our use of this experimental setup.

      Alternative functional hypotheses for the observed variation in bullseye size in herbivore resistance or floral volatile release could also be mentioned in the Discussion. Are the large ligules involved in floral scent release?

      We have added sentences in the Results and Discussion, and Conclusions section in the revised manuscript to explore possible additional factors that could influence patterns of UV pigmentation across sunflower populations, including resistance to herbivory and floral volatiles. While some work has been done to characterize floral volatiles in sunflower (e.g. Etievant et al. J. Agric. Food Chem; Pham-Delegue et al. J. Chem. Ecol. 1989), to our knowledge the role of ligules in their production has not been investigates.

      In the revised manuscript, the section “A dual role for floral UV pigmentation” now includes the sentences:

      “Although pollinator preferences in this experiment could still affected by other unmeasured factors (nectar content, floral volatiles), these results are consistent with previous results showing that floral UV patterns play a major role in pollinator attraction (Horth et al., 2014, Koski ad Ashman, 2014, Rae and Vamosi, 2013, Sheehan et al., 2016).” (new lines 378-381)

      And the Conclusions sections includes the sentence:

      “It should be noted that, while we have examined some of the most likely factors explaining the distribution of variation for floral UV patterns in wild H. annuus across North America, other abiotic factors could play a role, as well as biotic ones (e.g. the aforementioned differences in pollinator assemblages, or a role of UV pigments in protection from herbivory (Gronquist et al., 2001)).” (new lines 540-544)

      Reviewer #3 (Public Review):

      Todesco et al undertake an ambitious study to understand UV-absorbing variation in sunflower inflorescences, which often, but not always display a "bullseye" pattern of UV-absorbance generated by ligules of the ray flowers. [...] I think this manuscript has high potential impact on science on both of these fronts.

      Thank you! We are aware that our experiments do not provide a direct link between UV patterns and fitness in natural populations (although we think they are strongly suggestive) and that, as pointed out also by other reviewers, there are other possible (unmeasured) factors that could explain or contribute to explain the patterns we observed. In the revised manuscript we have better characterized the aims and interpretation of our desiccation experiment, and modified the main text to acknowledge other possible factors affecting pollination preferences (nectar production, floral volatiles) and variation for floral UV patterns in H. annuus (pollinator assemblages, resistance to herbivory).

    1. Author Response

      Reviewer #1 (Public Review):

      The work by Yijun Zhang and Zhimin He at al. analyzes the role of HDAC3 within DC subsets. Using an inducible ERT2-cre mouse model they observe the dependency of pDCs but not cDCs on HDAC3. The requirement of this histone modifier appears to be early during development around the CLP stage. Tamoxifen treated mice lack almost all pDCs besides lymphoid progenitors. Through bulk RNA seq experiment the authors identify multiple DC specific target gens within the remaining pDCs and further using Cut and Tag technology they validate some of the identified targets of HDAC3. Collectively the study is well executed and shows the requirement of HDAC3 on pDCs but not cDCs, in line with the recent findings of a lymphoid origin of pDC.

      1) While the authors provide extensive data on the requirement of HDAC3 within progenitors, the high expression of HDAC3 in mature pDCs may underly a functional requirement. Have you tested INF production in CD11c cre pDCs? Are there transcriptional differences between pDCs from HDAC CD11c cre and WT mice?

      We greatly appreciate the reviewer’s point. We have confirmed that Hdac3 can be efficiently deleted in pDCs of Hdac3fl/fl-CD11c Cre mice (Figure 5-figure supplement 1 in revised manuscript). Furthermore, in those Hdac3fl/fl-CD11c Cre mice, we have observed significantly decreased expression of key cytokines (Ifna, Ifnb, and Ifnl) by pDCs upon activation by CpG ODN (shown in Author response image 1). Therefore, HDAC3 is also required for proper pDC function. However, we have yet to conduct RNA-seq analysis comparing pDCs from HDAC CD11c cre and WT mice.

      Author response image 1.

      Cytokine expression in Hdac3 deficient pDCs upon activation

      2) A more detailed characterization of the progenitor compartment that is compromised following depletion would be important, as also suggested in the specific points.

      We thank the reviewer for this constructive suggestion. We have performed thorough analysis of the phenotype of hematopoietic stem cells and progenitor cells at various developmental stages in the bone marrow of Hdac3 deficient mice, based on the gating strategy from the recommended reference. Briefly, we analyzed the subpopulations of progenitors based on the description in the published report by "Pietras et al. 2015", namely MPP2, MPP3 and MPP4, using the same gating strategy for hematopoietic stem/progenitor cells. As shown in Author response image 2 and Author response image 3, we found that the number of LSK cells was increased in Hdac3 deficient mice, especially the subpopulations of MPP2 and MPP3, whereas no significant changes in MPP4. In contrast, the numbers of LT-HSC, ST-HSC and CLP were all dramatically decreased. This result has been optimized and added as Figure 3A in revised manuscript. The relevant description has been added and underlined in the revised manuscript Page 6 Line 164-168.

      Author response image 2.

      Gating strategy for hematopoietic stem/progenitor cells in bone marrow.

      Author response image 3.

      Hematopoietic stem/progenitor cells in Hdac3 deficient mice

      Reviewer #2 (Public Review):

      In this article Zhang et al. report that the Histone Deacetylase-3 (HDAC3) is highly expressed in mouse pDC and that pDC development is severely affected both in vivo and in vitro when using mice harbouring conditional deletion of HDAC3. However, pDC numbers are not affected in Hdac3fl/fl Itgax-Cre mice, indicating that HDCA3 is dispensable in CD11c+ late stages of pDC differentiation. Indeed, the authors provide wide experimental evidence for a role of HDAC3 in early precursors of pDC development, by combining adoptive transfer, gene expression profiling and in vitro differentiation experiments. Mechanistically, the authors have demonstrated that HDAC3 activity represses the expression of several transcription factors promoting cDC1 development, thus allowing the expression of genes involved in pDC development. In conclusion, these findings reveals HDAC3 as a key epigenetic regulator of the expression of the transcription factors required for pDC vs cDC1 developmental fate.

      These results are novel and very promising. However, supplementary information and eventual further investigations are required to improve the clarity and the robustness of this article.

      Major points

      1) The gating strategy adopted to identify pDC in the BM and in the spleen should be entirely described and shown, at least as a Supplementary Figure. For the BM the authors indicate in the M & M section that they negatively selected cells for CD8a and B220, but both markers are actually expressed by differentiated pDC. However, in the Figures 1 and 2 pDC has been shown to be gated on CD19- CD11b- CD11c+. What is the precise protocol followed for pDC gating in the different organs and experiments?

      We apologize for not clearly describing the protocols used in this study. Please see the detailed gating strategy for pDC in bone marrow, and for pDC and cDC in spleen (Figure 4 and Figure 5). These information are now added to Figure1−figure supplement 3, The relevant description has been underlined in Page 5 Line 113-116, in revised manuscript.

      We would like to clarify that in our study, we used two different panels of antibody cocktails, one for bone marrow Lin- cells, including mAbs to CD2/CD3/TER-119/Ly6G/B220/CD11b/CD8/CD19; the other for DC enrichment, including mAbs to CD3/CD90/TER-119/Ly6G/CD19. We included B220 in the Lineage cocktails to deplete B cells and pDCs, in order to enrich for the progenitor cells from bone marrow. However, when enriching for the pDC and cDC, B220 or CD8a were not included in the cocktail to avoid depletion of pDC and cDC1 subsets . For the flow cytometry analysis of pDCs, we gated pDCs as the CD19−CD11b−CD11c+B220+SiglecH+ population in both bone marrow and spleen. The relevant description has been underlined in the revised manuscript Page 16 Line 431-434.

      2) pDC identified in the BM as SiglecH+ B220+ can actually contain DC precursors, that can express these markers, too. This could explain why the impact of HDAC3 deletion appears stronger in the spleen than in the BM (Figures 1A and 2A). Along the same line, I think that it would important to show the phenotype of pDC in control vs HDAC3-deleted mice for the different pDC markers used (SiglecH, B220, Bst2) and I would suggest to include also Ly6D, taking also in account the results obtained in Figures 4 and 7. Finally, as HDCA3 deletion induces downregulation of CD8a in cDC1 and pDC express CD8a, it would important to analyse the expression of this marker on control vs HDAC3-deleted pDC.

      We agree with the reviewer’s points. In the revised manuscript, we incorporated major surface markers, including Siglec H, B220, Ly6D, and PDCA-1, all of which consistently demonstrated a substantial decrease in the pDC population in Hdac3 deficient mice. Moreover, we did notice that Ly6D+ pDCs showed higher degree of decrease in Hdac3 deficient mice. Additionally, percentage and number of both CD8+ pDC and CD8- pDC were decreased in Hdac3 deficient mice (Author response image 4). These results are shown in Figure1−figure supplement 4 of the revised manuscript. The relevant description has been added and underlined in the revised manuscript Page 5 Line 121-125.

      Author response image 4.

      Bone marrow pDCs in Hdac3 deficient mice revealed by multiple surface markers

      3) How do the authors explain that in the absence of HDAC3 cDC2 development increased in vivo in chimeric mice, but reduced in vitro (Figures 2B and 2E)?

      As shown in the response to the Minor point 5 of Reviewer#1. Briefly, we suggested that the variabilities maybe explained by the timing of anaysis after HDAC3 deletion. In Figure 2C, we analyzed cells from the recipients one week after the final tamoxifen treatment and observed no significant change in the percentage of cDC2 when further pooled all the experiment data. In Figure 2E, where tamoxifen was administered at Day 0 in Flt3L-mediated DC differentiation in vitro, the DC subsets generated were then analyzed at different time points. We observed no significant changes in cDCs and cDC2 at Day 5, but decreases in the percentage of cDC2 were observed at Day 7 and Day 9. This suggested that the cDC subsets at Day 5 might have originated from progenitors at a later stage, while those at Day 7 and Day 9 might originate form the earlier progenitors. Therefore, based on these in vitro and in vivo experiments, we believe that the variation in the cDC2 phenotype might be attributed to the progenitors at different stages that generated these cDCs.

      4) More generally, as reported also by authors (line 207), the reconstitution with HDAC3-deleted cells is poorly efficient. Although cDC seem not to be impacted, are other lymphoid or myeloid cells affected? This should be expected as HDAC3 regulates T and B development, as well as macrophage function. This should be important to know, although this does not call into question the results shown, as obtained in a competitive context.

      In this study, we found no significant influence on T cells, mature B cells or NK cells, but immature B cells were significantly decreased, in Hdac3-ERT2-Cre mice after tamoxifen treatment (Figure 6). However, in the bone marrow chimera experiments, the numbers of major lymphoid cells were decreased due to the impaired reconstitution capacity of Hdac3 deficient progenitors. Consistent with our finding, it has been reported that HDAC3 was required for T cell and B cell generation, in HDAC3-VavCre mice (Summers et al., 2013), and was necessary for T cell maturation (Hsu et al., 2015). Moreover, HDAC3 is also required for the expression of inflammatory genes in macrophages upon activation (Chen et al., 2012; Nguyen et al., 2020).

      5) What are the precise gating strategies used to identify the different hematopoietic precursors in the Figure 4 ? In particular, is there any lineage exclusion performed?

      We apologize for not describing the experimental procedures clearly. In this study we enriched the lineage negative (Lin−) cells from the bone marrow using a Lineage-depleting antibody cocktail including mAbs to CD2/CD3/TER-119/Ly6G/B220/CD11b/CD8/CD19. We also provide the gating strategy implemented for sorting LSK and CDP populations from the Lin− cells in the bone marrow (Author response image 5), shown in the Figure 3A and Figure4−figure supplement 1 of revised manuscript.

      Author response image 5.

      Gating strategy for LSK, CD115+ CDP and CD115− CDP in bone marrow

      6) Moreover, what is the SiglecH+ CD11c- population appearing in the spleen of mice reconstituted with HDAC3-deleted CDP, in Fig 4D?

      We also noticed the appearance of a SiglecH+CD11c− cell population in the spleen of recipient mice reconstituted with HDAC3-deficient CD115−CDPs, while the presence of this population was not as significant in the HDAC3-Ctrl group, as shown in Figure 4D. We speculate that this SiglecH+CD11c− cell population might represent some cells at a differentiation stage earlier than pre-DCs. Alternatively, the relatively increased percentage of this population derived from HDAC3-deficient CD115−CDP might be due to the substantially decreased total numbers of DCs. This could be clarified by further analysis using additional cell surface markers.

      7) Finally, in Fig 4H, how do the authors explain that Hdac3fl/fl express Il7r, while they are supposed to be sorted CD127- cells?

      This is indeed an interesting question. In this study, we confirmed that CD115−CDPs were isolated from the surface CD127− cell population for RNA-seq analysis, and the purity of the sorted cells were checked (Author response image 6), as shown in Figure4−figure supplement 1 in revised manuscript.

      The possible explanation for the expression of Il7r mRNA in some HDAC3fl/fl CD115−CDPs, as revealed in Figure 4H by RNA-seq analysis, could be due to a very low level of cell surface expression of CD127, these cells therefore could not be efficiently excluded by sorting for surface CD127- cells.

      Author response image 6.

      CD115−CDPs sorting from Hdac3-Ctrl and Hdac3-KO mice

      8) What is known about the expression of HDAC3 in the different hematopoietic precursors analysed in this study? This information is available only for a few of them in Supplementary Figure 1. If not yet studied, they should be addressed.

      We conducted additional analysis to address the expression of Hdac3 in various hematopoietic progenitor cells at different stages, based on the RNA-seq analyis. The data revealed a relatively consistent level of Hdac3 expression in progenitor populations, including HSC, MMP4, CLP, CDP and BM pDCs (Author response image 7). That suggests that HDAC3 may play an important role in the regulation of hematopoiesis at multiple stages. This information is now added in Figure1−figure supplement 1B of revised manuscript.

      Author response image 7.

      Hdac3 expression in hematopoietic progenitor cells

      9) It would be highly informative to extend CUT and Tag studies to Irf8 and Tcf4, if this is technically feasible.

      We totally agree with the reviewer. We have indeed attempted using CUT and Tag study to compare the binding sites of IRF8 and TCF4 in wild-type and Hdac3-deficient pDCs. However, it proved that this is technically unfeasible to get reliable results due to the limited number of cells we could obtain from the HDAC3 deficient mice. We are committed to explore alternative approaches or technologies in future studies to address this issue.

    1. Author Response:

      Reviewer #1:

      1) The user manual and tutorial are well documented, although the actual code could do with more explicit documentation and comments throughout. The overall organisation of the code is also a bit messy.

      We have now implemented an ongoing, automated code review via Codacy (https://app.codacy.com/gh/caseypaquola/BigBrainWarp/dashboard). The grade is published as a badge on GitHub. We improved the quality of the code to an A grade by increasing comments and fixing code style issues. Additionally, we standardised the nomenclature throughout the toolbox to improve consistency across scripts and we restructured the bigbrainwarp function.

      2) My understanding is that this toolbox can take maps from BigBrain to MRI space and vice versa, but the maps that go in the direction BigBrain->MRI seem to be confined to those provided in the toolbox (essentially the density profiles). What if someone wants to do some different analysis on the BigBrain data (e.g. looking at cellular morphology) and wants that mapped onto MRI spaces? Does this tool allow for analyses that involve the raw BigBrain data? If so, then at what resolution and with what scripts? I think this tool will have much more impact if that was possible. Currently, it looks as though the 3 tutorial examples are basically the only thing that can be done (although I may be lacking imagination here).

      The bigbrainwarp function allows input of raw BigBrain data in volume and surface forms. For volumetric inputs, the image must be aligned to the full BigBrain or BigBrainSym volume, but the function is agnostic to the input voxel resolution. We have also added an option for the user to specify the output voxel resolution. For example,

      bigbrainwarp --in_space bigbrain --in_vol cellular_morphology_in_bigbrain.nii \ --interp linear --out_space icbm --out_res 0.5 \ --desc cellular_morphology --wd working_directory

      where “cellular_morphology_in_bigbrain.nii” was generated from a BigBrain volume (see Table 2 below for all parameters). The BigBrain volume may be the 100-1000um resolution images provided on the ftp or a resampled version of these images, as long as the full field of view is maintained. For surface-based inputs, the data must contain a value for each vertex of the BigBrain/BigBrainSym mesh. We have clarified these points in the Methods, illustrated the potential transformations in an extended Figure 3 and highlighted the distinctiveness of the tutorial transformations in the Results.

      3) An obvious caveat to bigbrain is that it is a single brain and we know there are sometimes substantial individual variations in e.g. areal definition. This is only slightly touched upon in the discussion. Might be worth commenting on this more. As I see it, there are multiple considerations. For example (i) Surface-to-Surface registration in the presence of morphological idiosyncracies: what parts of the brain can we "trust" and what parts are uncertain? (ii) MRI parcellations mapped onto BigBrain will vary in how accurately they may reflect the BigBrain areal boundaries: if histo boundaries do not correspond with MRI-derived ones, is that because BigBrain is slightly different or is it a genuine divergence between modalities? Of course addressing these questions is out of scope of this manuscript, but some discussion could be useful; I also think this toolbox may be useful for addressing this very concerns!

      We agree that these are important questions and hope that BigBrainWarp will propel further research. Here, we consider these questions from two perspectives; the accuracy of the transformations and the potential influence of individual variation. For the former, we conducted a quantitative analysis on the accuracy of transformations used in BigBrainWarp (new Figure 2). We provide a function (evaluate_warp.sh) for BigBrainWarp users to assess accuracy of novel deformation fields and encourage detailed inspection of accuracy estimates and deformation effects for region of interest studies. For the latter, we expanded our Discussion of previous research on inter-individual variability and comment on the potential implications of unquantified inter-individual variability for the interpretation of BigBrain-MRI comparisons.

      Methods (P.7-8):

      “A prior study (Xiao et al., 2019) was able to further improve the accuracy of the transformation for subcortical structures and the hippocampus using a two-stage multi-contrast registration. The first stage involved nonlinear registration of BigBrainSym to a PD25 T1-T2 fusion atlas (Xiao et al., 2017, 2015), using manual segmentations of the basal ganglia, red nucleus, thalamus, amygdala, and hippocampus as additional shape priors. Notably, the PD25 T1-T2 fusion contrast is more similar to the BigBrainSym intensity contrast than a T1-weighted image. The second stage involved nonlinear registration of PD25 to ICBM2009sym and ICBM2009asym using multiple contrasts. The deformation fields were made available on Open Science Framework (https://osf.io/xkqb3/). The accuracy of the transformations was evaluated relative to overlap of region labels and alignment of anatomical fiducials (Lau et al., 2019). The two-stage procedure resulted in 0.86-0.97 Dice coefficients for region labels, improving upon direct overlap of BigBrainSym with ICBM2009sym (0.55-0.91 Dice) (Figure 2Aii, 2Aiv top). Transformed anatomical fiducials exhibited 1.77±1.25mm errors, on par with direct overlap of BigBrainSym with ICBM2009sym (1.83±1.47mm) (Figure 2Aiii, 2Aiv below). The maximum misregistration distance (BigBrainSym=6.36mm, Xiao=5.29mm) provides an approximation of the degree of uncertainty in the transformation. In line with this work, BigBrainWarp enables evaluation of novel deformation fields using anatomical fiducials and region labels (evaluate_warps.sh). The script accepts a nonlinear transformation file for registration of BigBrainSym to ICBM2009sym, or vice versa, and returns the Jacobian map, Dice coefficients for labelled regions and landmark misregistration distances for the anatomical fiducials.

      The unique morphology of BigBrain also presents challenges for surface-based transformations. Idiosyncratic gyrification of certain regions of BigBrain, especially the anterior cingulate, cause misregistration (Lewis et al., 2020). Additionally, the areal midline representation of BigBrain, following inflation to a sphere, is disproportionately smaller than standard surface templates, which is related to differences in surface area, in hemisphere separation methods, and in tessellation methods. To overcome these issues, ongoing work (Lewis et al., 2020) combines a specialised BigBrain surface mesh with multimodal surface matching [MSM; (Robinson et al., 2018, 2014)] to co-register BigBrain to standard surface templates. In the first step, the BigBrain surface meshes were re-tessellated as unstructured meshes with variable vertex density (Möbius and Kobbelt, 2010) to be more compatible with FreeSurfer generated meshes. Then, coarse-to-fine MSM registration was applied in three stages. An affine rotation was applied to the BigBrain sphere, with an additional “nudge” based on an anterior cingulate landmark. Next, nonlinear/discrete alignment using sulcal depth maps (emphasising global scale, Figure 2Biii), followed by nonlinear/discrete alignment using curvature maps (emphasising finer detail, Figure 2Biii). The higher- order MSM procedure that was implemented for BigBrain maximises concordance of these features while minimising surface deformations in a physically plausible manner, accounting for size and shape distortions (Figure 2Bi) (Knutsen et al., 2010; Robinson et al., 2018). This modified MSMsulc+curv pipeline improves the accuracy of transformed cortical maps (4.38±3.25mm), compared to a standard MSMsulc approach (8.02±7.53mm) (Figure 2Bii-iii) (Lewis et al., 2020).”

      Figure 2: Evaluating BigBrain-MRI transformations. A) Volume-based transformations i. Jacobian determinant of deformation field shown with a sagittal slice and stratified by lobe. Subcortical+ includes the shape priors (as described in Methods) and the + connotes hippocampus, which is allocortical. Lobe labels were defined based on assignment of CerebrA atlas labels (Manera et al., 2020) to each lobe. ii. Sagittal slices illustrate the overlap of native ICBM2009b and transformed subcortical+ labels. iii. Superior view of anatomical fiducials (Lau et al., 2019). iv. Violin plots show the DICE coefficient of regional overlap (ii) and landmark misregistration (iii) for the BigBrainSym and Xiao et al., approaches. Higher DICE coefficients shown improved registration of subcortical+ regions with Xiao et al., while distributions of landmark misregistration indicate similar performance for alignment of anatomical fiducials. B) Surface-based transformations. i. Inflated BigBrain surface projections and ridgeplots illustrate regional variation in the distortions of the mesh invoked by the modified MSMsulc+curv pipeline. ii. Eighteen anatomical landmarks shown on the inflated BigBrain surface (above) and inflated fsaverage (below). BigBrain landmarks were transformed to fsaverage using the modified MSMsulc+curv pipeline. Accuracy of the transformation was calculated on fsaverage as the geodesic distance between landmarks transformed from BigBrain and the native fsaverage landmarks. iii. Sulcal depth and curvature maps are shown on inflated BigBrain surface. Violin plots show the improved accuracy of the transformation using the modified MSMsulc+curv pipeline, compared to a standard MSMsulc approach.

      Discussion (P.18):

      “Cortical folding is variably associated with cytoarchitecture, however. The correspondence of morphology with cytoarchitectonic boundaries is stronger in primary sensory than association cortex (Fischl et al., 2008; Rajkowska and Goldman-Rakic, 1995a, 1995b). Incorporating more anatomical information in the alignment algorithm, such as intracortical myelin or connectivity, may benefit registration, as has been shown in neuroimaging (Orasanu et al., 2016; Robinson et al., 2018; Tardif et al., 2015). Overall, evaluating the accuracy of volume- and surface-based transformations is important for selecting the optimal procedure given a specific research question and to gauge the degree of uncertainty in a registration.”

      Discussion (P.19):

      “Despite all its promises, the singular nature of BigBrain currently prohibits replication and does not capture important inter-individual variation. While large-scale cytoarchitectural patterns are conserved across individuals, the position of areal boundaries relative to sulci vary, especially in association cortex (Amunts et al., 2020; Fischl et al., 2008; Zilles and Amunts, 2013) . This can affect interpretation of BigBrain-MRI comparisons. For instance, in tutorial 3, low predictive accuracy of functional communities by cytoarchitecture may be attributable to the subject- specific topographies, which are well established in functional imaging (Benkarim et al., 2020; Braga and Buckner, 2017; Gordon et al., 2017; Kong et al., 2019). Future studies should consider the influence of inter-subject variability in concert with the precision of transformations, as these two elements of uncertainty can impact our interpretations, especially at higher granularity.”

      Reviewer #2:

      This is a nice paper presenting a review of recent developments and research resulting from BigBrain and a tutorial guiding use of the BigBrainWarp toolbox. This toolbox supports registration to, and from, standard MRI volumetric and surface templates, together with mapping derived features between spaces. Examples include projecting histological gradients estimated from BigBrain onto fsaverage (and the ICMB2009 atlas) and projecting Yeo functional parcels onto the BigBrain atlas.

      The key strength of this paper is that it supports and expands on a comprehensive tutorial and docker support available from the website. The tutorials there go into even more detail (with accompanying bash scripts) of how to run the full pipelines detailed in the paper. The docker makes the tool very easy to install but I was also able to install from source. The tutorials are diverse examples of broad possible applications; as such the combined resource has the potential to be highly impactful.

      The minor weaknesses of the paper relate to its clarity and depth. Firstly, I found the motivations of the paper initially unclear from the abstract. I would recommend much more clearly stating that this is a review paper of recent research developments resulting from the BigBrain atlas, and a tutorial to accompany the bash scripts which apply the warps between spaces. The registration methodology is explained elsewhere.

      In the revised Abstract (P.1), we emphasise that the manuscript involves a review of recent literature, the introduction of BigBrainWarp, and easy-to-follow tutorials to demonstrate its utility.

      “Neuroimaging stands to benefit from emerging ultrahigh-resolution 3D histological atlases of the human brain; the first of which is “BigBrain”. Here, we review recent methodological advances for the integration of BigBrain with multi-modal neuroimaging and introduce a toolbox, “BigBrainWarp", that combines these developments. The aim of BigBrainWarp is to simplify workflows and support the adoption of best practices. This is accomplished with a simple wrapper function that allows users to easily map data between BigBrain and standard MRI spaces. The function automatically pulls specialised transformation procedures, based on ongoing research from a wide collaborative network of researchers. Additionally, the toolbox improves accessibility of histological information through dissemination of ready-to-use cytoarchitectural features. Finally, we demonstrate the utility of BigBrainWarp with three tutorials and discuss the potential of the toolbox to support multi-scale investigations of brain organisation.”

      I also found parts of the paper difficult to follow - as a methodologist without comprehensive neuroanatomical terminology, I would recommend the review of past work to be written in a more 'lay' way. In many cases, the figure captions also seemed insufficient at first. For example it was not immediately obvious to me what is meant by 'mesiotemporal confluence' and Fig 1G is not referenced specifically in the text. In Fig 3C it is not immediately clear from the text of the caption that the cortical image is representing the correlation from the plots - specifically since functional connectivity is itself estimated through correlation.

      In the updated manuscript, we have tried to remove neuroanatomical jargon and clearly define uncommon terms at the first instance in text. For example,

      “Evidence has been provided that cortical organisation goes beyond a segregation into areas. For example, large- scale gradients that span areas and cytoarchitectonic heterogeneity within a cortical area have been reported (Amunts and Zilles, 2015; Goulas et al., 2018; Wang, 2020). Such progress became feasible through integration of classical techniques with computational methods, supporting more observer-independent evaluation of architectonic principles (Amunts et al., 2020; Paquola et al., 2019; Schiffer et al., 2020; Spitzer et al., 2018). This paves the way for novel investigations of the cellular landscape of the brain.”

      “Using the proximal-distal axis of the hippocampus, we were able to bridge the isocortical and hippocampal surface models recapitulating the smooth confluence of cortical types in the mesiotemporal lobe, i.e. the mesiotemporal confluence (Figure 1G).”

      “Here, we illustrate how we can track resting-state functional connectivity changes along the latero-medial axis of the mesiotemporal lobe, from parahippocampal isocortex towards hippocampal allocortex, hereafter referred to as the iso-to-allocortical axis.”

      Additionally, we have expanded the captions for clarity. For example, Figure 3:

      “C) Intrinsic functional connectivity was calculated between each voxel of the iso-to-allocortical axis and 1000 isocortical parcels. For each parcel, we calculated the product-moment correlation (r) of rsFC strength with iso-to- allocortical axis position. Thus, positive values (red) indicate that rsFC of that isocortical parcel with the mesiotemporal lobe increases along the iso-to-allocortex axis, whereas negative values (blue) indicate decrease in rsFC along the iso-to-allocortex axis.”

      My minor concern is over the lack of details in relation to the registration pipelines. I understand these are either covered in previous papers or are probably destined for bespoke publications (in the case of the surface registration approach) but these details are important for readers to understand the constraints and limitations of the software. At this time, the details for the surface registration only relate to an OHBM poster and not a publication, which I was unable to find online until I went through the tutorial on the BigBrain website. In general I think a paper should have enough information on key techniques to stand alone without having to reference other publications, so, in my opinion, a high level review of these pipelines should be added here.

      There isn't enough details on the registration. For the surface, what features were used to drive alignment, how was it parameterised (in particular the regularisation - strain, pairwise or areal), how was it pre-processed prior to running MSM - all these details seem to be in the excellent poster. I appreciate that work deserves a stand alone publication but some details are required here for users to understand the challenges, constraints and limitations of the alignment. Similar high level details should be given for the registration work.

      We expanded descriptions of the registration strategies behind BigBrainWarp, especially so for the surface-based registration. Additionally, we created a new Figure to illustrate how the accuracy of the transformations may be evaluated.

      Methods (P.7-8):

      “For the initial BigBrain release (Amunts et al., 2013), full BigBrain volumes were resampled to ICBM2009sym (a symmetric MNI152 template) and MNI-ADNI (an older adult T1-weighted template) (Fonov et al., 2011). Registration of BigBrain to ICBM2009sym, known as BigBrainSym, involved a linear then a nonlinear transformation (available on ftp://bigbrain.loris.ca/BigBrainRelease.2015/). The nonlinear transformation was defined by a symmetric diffeomorphic optimiser [SyN algorithm, (Avants et al., 2008)] that maximised the cross- correlation of the BigBrain volume with inverted intensities and a population-averaged T1-weighted map in ICBM2009sym space. The Jacobian determinant of the deformation field illustrates the degree and direction of distortions on the BigBrain volume (Figure 2Ai top).

      A prior study (Xiao et al., 2019) was able to further improve the accuracy of the transformation for subcortical structures and the hippocampus using a two-stage multi-contrast registration. The first stage involved nonlinear registration of BigBrainSym to a PD25 T1-T2 fusion atlas (Xiao et al., 2017, 2015), using manual segmentations of the basal ganglia, red nucleus, thalamus, amygdala, and hippocampus as additional shape priors. Notably, the PD25 T1-T2 fusion contrast is more similar to the BigBrainSym intensity contrast than a T1-weighted image. The second stage involved nonlinear registration of PD25 to ICBM2009sym and ICBM2009asym using multiple contrasts. The deformation fields were made available on Open Science Framework (https://osf.io/xkqb3/). The accuracy of the transformations was evaluated relative to overlap of region labels and alignment of anatomical fiducials (Lau et al., 2019). The two-stage procedure resulted in 0.86-0.97 Dice coefficients for region labels, improving upon direct overlap of BigBrainSym with ICBM2009sym (0.55-0.91 Dice) (Figure 2Aii, 2Aiv top). Transformed anatomical fiducials exhibited 1.77±1.25mm errors, on par with direct overlap of BigBrainSym with ICBM2009sym (1.83±1.47mm) (Figure 2Aiii, 2Aiv below). The maximum misregistration distance (BigBrainSym=6.36mm, Xiao=5.29mm) provides an approximation of the degree of uncertainty in the transformation. In line with this work, BigBrainWarp enables evaluation of novel deformation fields using anatomical fiducials and region labels (evaluate_warps.sh). The script accepts a nonlinear transformation file for registration of BigBrainSym to ICBM2009sym, or vice versa, and returns the Jacobian map, DICE coefficients for labelled regions and landmark misregistration distances for the anatomical fiducials.

      The unique morphology of BigBrain also presents challenges for surface-based transformations. Idiosyncratic gyrification of certain regions of BigBrain, especially the anterior cingulate, cause misregistration (Lewis et al., 2020). Additionally, the areal midline representation of BigBrain, following inflation to a sphere, is disproportionately smaller than standard surface templates, which is related to differences in surface area, in hemisphere separation methods, and in tessellation methods. To overcome these issues, ongoing work (Lewis et al., 2020) combines a specialised BigBrain surface mesh with multimodal surface matching [MSM; (Robinson et al., 2018, 2014)] to co-register BigBrain to standard surface templates. In the first step, the BigBrain surface meshes were re-tessellated as unstructured meshes with variable vertex density (Möbius and Kobbelt, 2010) to be more compatible with FreeSurfer generated meshes. Then, coarse-to-fine MSM registration was applied in three stages. An affine rotation was applied to the BigBrain sphere, with an additional “nudge” based on an anterior cingulate landmark. Next, nonlinear/discrete alignment using sulcal depth maps (emphasising global scale, Figure 2Biii), followed by nonlinear/discrete alignment using curvature maps (emphasising finer detail, Figure 2Biii). The higher- order MSM procedure that was implemented for BigBrain maximises concordance of these features while minimising surface deformations in a physically plausible manner, accounting for size and shape distortions (Figure 2Bi) (Knutsen et al., 2010; Robinson et al., 2018). This modified MSMsulc+curv pipeline improves the accuracy of transformed cortical maps (4.38±3.25mm), compared to a standard MSMsulc approach (8.02±7.53mm) (Figure 2Bii-iii) (Lewis et al., 2020).”

      (SEE FIGURE 2 in Response to Reviewer #1)

      I would also recommend more guidance in terms of limitations relating to inter-subject variation. My interpretation of the results of tutorial 3, is that topographic variation of the cortex could easily be driving the greater variation of the frontal parietal networks. Either that, or the Yeo parcel has insufficient granularity; however, in that case any attempt to go to finer MRI driven parcellations - for example to the HCP parcellation, would create its own problems due to subject specific variability.

      We agree that inter-individual variation may contribute to the low predictive accuracy of functional communities by cytoarchitecture. We expanded upon this possibility in the revised Discussion (P. 19) and recommend that future studies examine the uncertainty of subject-specific topographies in concert with uncertainties of transformations.

      “These features depict the vast cytoarchitectural heterogeneity of the cortex and enable evaluation of homogeneity within imaging-based parcellations, for example macroscale functional communities (Yeo et al., 2011). The present analysis showed limited predictability of functional communities by cytoarchitectural profiles, even when accounting for uncertainty at the boundaries (Gordon et al., 2016). [...] Despite all its promises, the singular nature of BigBrain currently prohibits replication and does not capture important inter-individual variation. While large- scale cytoarchitectural patterns are conserved across individuals, the position of boundaries relative to sulci vary, especially in association cortex (Amunts et al., 2020; Fischl et al., 2008; Zilles and Amunts, 2013) . This can affect interpretation of BigBrain-MRI comparisons. For instance, in tutorial 3, low predictive accuracy of functional communities by cytoarchitecture may be attributable to the subject-specific topographies, which are well established in functional imaging (Benkarim et al., 2020; Braga and Buckner, 2017; Gordon et al., 2017; Kong et al., 2019). Future studies should consider the influence of inter-subject variability in concert with the precision of transformations, as these two elements of uncertainty can impact our interpretations, especially at higher granularity.”

      Reviewer #3:

      The authors make a point for the importance of considering high-resolution, cell-scale, histological knowledge for the analysis and interpretation of low-resolution MRI data. The manuscript describes the aims and relevance of the BigBrain project. The BigBrain is the whole brain of a single individual, sliced at 20µ and scanned at 1µ resolution. During the last years, a sustained work by the BigBrain team has led to the creation of a precise cell-scale, 3D reconstruction of this brain, together with manual and automatic segmentations of different structures. The manuscript introduces a new tool - BigBrainWarp - which consolidates several of the tools used to analyse BigBrain into a single, easy to use and well documented tool. This tool should make it easy for any researcher to use the wealth of information available in the BigBrain for the annotation of their own neuroimaging data. The authors provide three examples of utilisation of BigBrainWarp, and show the way in which this can provide additional insight for analysing and understanding neuroimaging data. The BigBrainWarp tool should have an important impact for neuroimaging research, helping bridge the multi-scale resolution gap, and providing a way for neuroimaging researchers to include cell-scale phenomena in their study of brain data. All data and code are available open source, open access.

      Main concern:

      One of the longstanding debates in the neuroimaging community concerns the relationship between brain geometry (in particular gyro/sulcal anatomy) and the cytoarchitectonic, connective and functional organisation of the brain. There are various examples of correspondance, but also many analyses showing its absence, particularly in associative cortex (for example, Fischl et al (2008) by some of the co-authors of the present manuscript). The manuscript emphasises the accuracy of their transformations to the different atlas spaces, which may give some readers a false impression. True: towards the end of the manuscript the authors briefly indicate the difficulty of having a single brain as source of histological data. I think, however, that the manuscript would benefit from making this point more clearly, providing the future users of BigBrainWarp with some conceptual elements and references that may help them properly apprise their results. In particular, it would be helpful to briefly describe which aspects of brain organisation where used to lead the deformation to the different templates, if they were only based on external anatomy, or if they took into account some other aspects such as myelination, thickness, …

      We agree with the Reviewer that the accuracy of the transformation and the potential influence of inter-individual variability should be carefully considered in BigBrain-MRI studies. To highlight these issues in the updated manuscript, we first conducted a quantitative analysis on the accuracy of transformations used in BigBrainWarp (new Figure 2). We provide a function (evaluate_warp.sh) for users to assess accuracy of novel deformation fields and encourage detailed inspection of accuracy estimates and deformation effects for region of interest studies. Second, we expanded our discussion of previous research on inter-individual variability and comment on the potential implications of unquantified inter-individual variability for the interpretation of BigBrain-MRI comparisons.

      Methods (P.7-8):

      “A prior study (Xiao et al., 2019) was able to further improve the accuracy of the transformation for subcortical structures and the hippocampus using a two-stage multi-contrast registration. The first stage involved nonlinear registration of BigBrainSym to a PD25 T1-T2 fusion atlas (Xiao et al., 2017, 2015), using manual segmentations of the basal ganglia, red nucleus, thalamus, amygdala, and hippocampus as additional shape priors. Notably, the PD25 T1-T2 fusion contrast is more similar to the BigBrainSym intensity contrast than a T1-weighted image. The second stage involved nonlinear registration of PD25 to ICBM2009sym and ICBM2009asym using multiple contrasts. The deformation fields were made available on Open Science Framework (https://osf.io/xkqb3/). The accuracy of the transformations was evaluated relative to overlap of region labels and alignment of anatomical fiducials (Lau et al., 2019). The two-stage procedure resulted in 0.86-0.97 Dice coefficients for region labels, improving upon direct overlap of BigBrainSym with ICBM2009sym (0.55-0.91 Dice) (Figure 2Aii, 2Aiv top). Transformed anatomical fiducials exhibited 1.77±1.25mm errors, on par with direct overlap of BigBrainSym with ICBM2009sym (1.83±1.47mm) (Figure 2Aiii, 2Aiv below). The maximum misregistration distance (BigBrainSym=6.36mm, Xiao=5.29mm) provides an approximation of the degree of uncertainty in the transformation. In line with this work, BigBrainWarp enables evaluation of novel deformation fields using anatomical fiducials and region labels (evaluate_warps.sh). The script accepts a nonlinear transformation file for registration of BigBrainSym to ICBM2009sym, or vice versa, and returns the Jacobian map, Dice coefficients for labelled regions and landmark misregistration distances for the anatomical fiducials.

      The unique morphology of BigBrain also presents challenges for surface-based transformations. Idiosyncratic gyrification of certain regions of BigBrain, especially the anterior cingulate, cause misregistration (Lewis et al., 2020). Additionally, the areal midline representation of BigBrain, following inflation to a sphere, is disproportionately smaller than standard surface templates, which is related to differences in surface area, in hemisphere separation methods, and in tessellation methods. To overcome these issues, ongoing work (Lewis et al., 2020) combines a specialised BigBrain surface mesh with multimodal surface matching [MSM; (Robinson et al., 2018, 2014)] to co-register BigBrain to standard surface templates. In the first step, the BigBrain surface meshes were re-tessellated as unstructured meshes with variable vertex density (Möbius and Kobbelt, 2010) to be more compatible with FreeSurfer generated meshes. Then, coarse-to-fine MSM registration was applied in three stages. An affine rotation was applied to the BigBrain sphere, with an additional “nudge” based on an anterior cingulate landmark. Next, nonlinear/discrete alignment using sulcal depth maps (emphasising global scale, Figure 2Biii), followed by nonlinear/discrete alignment using curvature maps (emphasising finer detail, Figure 2Biii). The higher- order MSM procedure that was implemented for BigBrain maximises concordance of these features while minimising surface deformations in a physically plausible manner, accounting for size and shape distortions (Figure 2Bi) (Knutsen et al., 2010; Robinson et al., 2018). This modified MSMsulc+curv pipeline improves the accuracy of transformed cortical maps (4.38±3.25mm), compared to a standard MSMsulc approach (8.02±7.53mm) (Figure 2Bii-iii) (Lewis et al., 2020).”

      (SEE Figure 2 in response to previous reviewers)

      Discussion (P.18, 19):

      “Cortical folding is variably associated with cytoarchitecture, however. The correspondence of morphology with cytoarchitectonic boundaries is stronger in primary sensory than association cortex (Fischl et al., 2008; Rajkowska and Goldman-Rakic, 1995a, 1995b). Incorporating more anatomical information in the alignment algorithm, such as intracortical myelin or connectivity, may benefit registration, as has been shown in neuroimaging (Orasanu et al., 2016; Robinson et al., 2018; Tardif et al., 2015). Overall, evaluating the accuracy of volume- and surface-based transformations is important for selecting the optimal procedure given a specific research question and to gauge the degree of uncertainty in a registration.”

      “Despite all its promises, the singular nature of BigBrain currently prohibits replication and does not capture important inter-individual variation. While large-scale cytoarchitectural patterns are conserved across individuals, the position of boundaries relative to sulci vary, especially in association cortex (Amunts et al., 2020; Fischl et al., 2008; Zilles and Amunts, 2013) . This can have implications on interpretation of BigBrain-MRI comparisons. For instance, in tutorial 3, low predictive accuracy of functional communities by cytoarchitecture may be attributable to the subject-specific topographies, which are well established in functional imaging (Benkarim et al., 2020; Braga and Buckner, 2017; Gordon et al., 2017; Kong et al., 2019). Future studies should consider the influence of inter- subject variability in concert with the precision of transformations, as these two elements of uncertainty can impact our interpretations, especially at higher granularity.”

      Minor:

      1) In the abstract and later in p9 the authors talk about "state-of-the-art" non-linear deformation matrices. This may be confusing for some readers. To me, in brain imaging a matrix is most often a 4x4 affine matrix describing a linear transformation. However, the authors seem to be describing a more complex, non-linear deformation field. Whereas building a deformation matrix (4x4 affine) is not a big challenge, I agree that more sophisticated tools should provide more sophisticated deformation fields. The authors may consider using "deformation field" instead of "deformation matrix", but I leave that to their judgment.

      As suggested, we changed the text to “deformation field” where relevant.

      2) In the results section, p11, the authors highlight the challenge of segmenting thalamic nuclei or different hippocampal regions, and suggest that this should be simplified by the use of the histological BigBrain data. However, the atlases currently provided in the OSF project do not include these more refined parcellation: there's one single "Thalamus" label, and one single "Hippocampus" label (not really single: left and right). This could be explicitly stated to prevent readers from having too high expectations (although I am certain that those finer parcellations should come in the very close future).

      We updated the text to reflect the current state of such parcellations. While subthalamic nuclei are not yet segmented (to our knowledge), one of the present authors has segmented hippocampal subfields (https://osf.io/bqus3/) and we highlight this in the Results (P.11-12):

      “Despite MRI acquisitions at high and ultra-high fields reaching submillimeter resolutions with ongoing technical advances, certain brain structures and subregions remain difficult to identify (Kulaga-Yoskovitz et al., 2015; Wisse et al., 2017; Yushkevich et al., 2015). For example, there are challenges in reliably defining the subthalamic nucleus (not yet released for BigBrain) or hippocampal Cornu Ammonis subfields [manual segmentation available on BigBrain, https://osf.io/bqus3/, (DeKraker et al., 2019)]. BigBrain-defined labels can be transformed to a standard imaging space for further investigation. Thus, this approach can support exploration of the functional architecture of histologically-defined regions of interest.”

    1. Author Response:

      Reviewer #2 (Public Review):

      Summary:

      Frey et al develop an automated decoding method, based on convolutional neural networks, for wideband neural activity recordings. This allows the entire neural signal (across all frequency bands) to be used as decoding inputs, as opposed to spike sorting or using specific LFP frequency bands. They show improved decoding accuracy relative to standard Bayesian decoder, and then demonstrate how their method can find the frequency bands that are important for decoding a given variable. This can help researchers to determine what aspects of the neural signal relate to given variables.

      Impact:

      I think this is a tool that has the potential to be widely useful for neuroscientists as part of their data analysis pipelines. The authors have publicly available code on github and Colab notebooks that make it easy to get started using their method.

      Relation to other methods:

      This paper takes the following 3 methods used in machine learning and signal processing, and combines them in a very useful way. 1) Frequency-based representations based on spectrograms or wavelet decompositions (e.g. Golshan et al, Journal of Neuroscience Methods, 2020; Vilamala et al, 2017 IEEE international workshop on on machine learning for signal processing). This is used for preprocessing the neural data; 2) Convolutional neural networks (many examples in Livezey and Glaser, Briefings in Bioinformatics, 2020). This is used to predict the decoding output; 3) Permutation feature importance, aka a shuffle analysis (https://scikit-learn.org/stable/modules/permutation_importance.htmlhttps://compstat-lmu.github.io/iml_methods_limitations/pfi.html). This is used to determine which input features are important. I think the authors could slightly improve their discussion/referencing of the connection to the related literature.

      Overall, I think this paper is a very useful contribution, but I do have a few concerns, as described below.

      We thank the reviewer for the encouraging feedback and the helpful summary of the approaches we used. We are happy to read that they consider the framework to be a very useful contribution to the field of neuroscience. The reviewer raises several important questions regarding the influence measure/feature importance, the data format of the SVM and how the model can be used on EEG/ECoG datasets. Moreover, they suggest clarifying the general overview of the approach and to connect it more to the related literature. These are very helpful and thoughtful comments and we are grateful to be given the opportunity to address them.

      Concerns:

      1) The interpretability of the method is not validated in simulations. To trust that this method uncovers the true frequency bands that matter for decoding a variable, I feel it's important to show the method discovers the truth when it is actually known (unlike in neural data). As a simple suggestion, you could take an actual wavelet decomposition, and create a simple linear mapping from a couple of the frequency bands to an imaginary variable; then, see whether your method determines these frequencies are the important ones. Even if the model does not recover the ground truth frequency bands perfectly (e.g. if it says correlated frequency bands matter, which is often a limitation of permutation feature importance), this would be very valuable for readers to be aware of.

      2) It's unclear how much data is needed to accurately recover the frequency bands that matter for decoding, which may be an important consideration for someone wanting to use your method. This could be tested in simulations as described above, and by subsampling from your CA1 recordings to see how the relative influence plots change.

      We thank the reviewer for this really interesting suggestion to validate our model using simulations. Accordingly, we have now trained our model on simulated behaviours, which we created via linear mapping to frequency bands. As shown in Figure 3 - Supplement 2B, the frequency bands modulated by the simulated behaviour can be clearly distinguished from the unmodulated frequency bands. To make the synthetic data more plausible we chose different multipliers (betas) for each frequency component which explains the difference between the peak at 58Hz (beta = 2) and the peak at 3750Hz (beta = 1).

      To generate a more detailed understanding of how the detected influence of a variable changes based on the amount of data available, we conducted an additional analysis. Using the real data, we subsampled the training data from 1 to 35 minutes and fully retrained the model using cross-validation. We then used the original feature importance implementation to calculate influence scores across each cross-validation split. To quantify the similarity between the original influence measure and the downsampled influence we calculated the Pearson correlation between the downsampled influence and the one obtained when using the full training set. As can be seen in Figure 3 - Supplement 2A our model achieves an accurate representation of the true influence with as little as 5 minutes of training data (mean Pearson's r = 0.89 ± 0.06)

      Page 8-9: To further assess the robustness of the influence measure we conducted two additional analyses. First, we tested how results depended on the amount of training data - (1 - 35 minutes, see Methods). We found that our model achieves an accurate representation of the true influence with as little as 5 minutes of training data (mean Pearson's r = 0.89 ± 0.06, Figure 3 - Supplement 2A). Secondly, we assessed influence accuracy on a simulated behaviour in which we varied the ground truth frequency information (see Methods). The model trained on the simulated behaviour is able to accurately represent the ground truth information (modulated frequencies 58 Hz & 3750 Hz, Figure 3 - Supplement 2B)

      Page 20: To evaluate if the influence measure accurately captures the true information content, we used simulated behaviours in which ground truth information was known. We used the preprocessed wavelet transformed data from one animal and created a simulated behaviour ysb using uniform random noise. Two frequency bands were then modulated by the simulated behaviour using fnew = fold * β * ysb. We used β=2 for 58Hz and β=1 for 3750Hz. We then retrained the model using five-fold cross validation and evaluated the influence measure as previously described. We report the proportion of frequency bands that fall into the correct frequencies (i.e. the frequencies we chose to be modulated, 58 Hz & 3750 Hz).

      New supplementary Figure:

      Figure 3 - Supplement 2: Decoding influence for downsampled models and simulations. (A) To measure the robustness of the influence measure we downsampled the training data and retrained the model using cross-validation. We plot the Pearson correlation between the original influence distribution using the full training set and the influence distribution obtained from the downsampled data. Each dot shows one cross-validation split. Inset shows influence plots for two runs, one for 35 minutes of training data, the other in which model training consisted of only 5 minutes of training data. (B) We quantified our influence measure using simulated behaviours. We used the wavelet preprocessed data from one CA1 recording and simulated two behavioural variables which were modulated by two frequencies (58Hz & 3750Hz) using different multipliers (betas 2 & 1). We then trained the model using cross-validation and calculated the influence scores via feature shuffling.

      3)

      a) It is not clear why your method leads to an increase in decoding accuracy (Fig. 1)? Is this simply because of the preprocessing you are using (using the Wavelet coefficients as inputs), or because of your convolutional neural network. Having a control where you provide the wavelet coefficients as inputs into a feedforward neural network would be useful, and a more meaningful comparison than the SVM. Side note - please provide more information on the SVM you are using for comparison (what is the kernel function, are you using regularization?).

      We thank the reviewer for this suggestion and are sorry for the lack of documentation regarding the support vector machine model. The support vector machine was indeed trained on the wavelet transformed data and not on the spike sorted data as we wanted a comparison model which also uses the raw data. The high error of the support vector machine on wavelet transformed data might stem from two problems: (1) The input by design loses all spatial relevant information as the 3-D representation (frequencies x channels x time) needs to be flattened into a 1-D vector in order to train an SVM on it and (2) the SVM therefore needs to deal with a huge number of features. For example, even though the wavelets are downsampled to 30Hz, one sample still consists of (64 timesteps * 128 channels * 26 frequencies) 212992 features, which leads the SVM to be very slow to train and to an overfit on the training set.

      This exact problem would also be present in a feedforward neural network that uses the wavelet coefficients as input. Any hidden layer connected to the input, using a reasonable amount of hidden units will result in a multi-million parameter model (e.g. 512 units will result in 109051904 parameters for just the first layer). These models are notoriously hard to train and won’t fit many consumer-grade GPUs, which is why for most spatial signals including images or higher-dimensional signals, convolutional layers are the preferred and often only option to train these models.

      We have now included more detailed information about the SVM (including kernel function and regularization parameters) in the methods section of the manuscript.

      Page 19:To generate a further baseline measure of performance when decoding using wavelet transformed coefficients, we trained support vector machines to decode position from wavelet transformed CA1 recordings. We used either a linear kernel or a non-linear radial-basis-function (RBF) kernel to train the model, using a regularization factor of C=100. For the non-linear RBF kernel we set gamma to the default 1 / (num_features * var(X)) as implemented in the sklearn framework. The SVM model was trained on the same wavelet coefficients as the convolutional neural network

      b) Relatedly, because the reason for the increase in decoding accuracy is not clear, I don't think you can make the claim that "The high accuracy and efficiency of the model suggest that our model utilizes additional information contained in the LFP as well as from sub-threshold spikes and those that were not successfully clustered." (line 122). Based on the shown evidence, it seems to me that all of the benefits vs. the Bayesian decoder could just be due to the nonlinearities of the convolutional neural network.

      Thanks for raising this interesting point regarding the linear vs. non-linear information contained in the neural data. Indeed, when training the model with a linear activation function for the convolutions and fully connected layers, model performance drops significantly. To quantify this we ran the model with three different configurations regarding its activation functions. We (1) used nonlinear activation functions only in the convolutional layers (2) or the fully connected layers or (3) only used linear activation functions throughout the whole model. As expected the model with only linear activation functions performed the worst (linear activation functions 61.61cm ± 33.85cm, non-linear convolutional layers 22.99cm ± 18.67cm, non-linear fully connected layers 47.03cm ± 29.61cm, all layers non-linear 18.89cm ± 4.66cm). For comparison the Bayesian decoder achieves a decoding accuracy of 23.25cm ± 2.79cm on this data.

      Thus it appears that the reviewer is correct - the advantage of the CNN model comes in part from the non-linearity of the convolutional layers. The corollary of this is that there are likely non-linear elements in the neural data that the CNN but not Bayes decoder can access. However, the CNN does also receive wider-band inputs and thus has the potential to utilize information beyond just detected spikes.

      In response to the reviewers point and to the new analysis regarding the LFP models raised by reviewer 1, we have now reworded this sentence in the manuscript.

      Page 4: The high accuracy and efficiency of the model for these harder samples suggest that the CNN utilizes additional information from sub-threshold spikes and those that were not successfully clustered, as well as nonlinear information which is not available to the Bayesian decoder.

    1. Author Response

      Reviewer #2 (Public Review):

      Portes et al. investigated the nanoscale architecture and dynamics of the osteoclast sealing zone using high-end microscopy techniques. They first use DONALD 3D single molecule localization microscopy on osteoclasts seeded on glass to study the lateral and axial localization of key components of the sealing zone. They show that for some components (vinculin, talin Cterminus), the axial localization was higher when molecules were in close proximity to the actin core while for other components (cortactin, actinin, filamin, paxillin), there was no difference in height as a function of distance from the actin core. They next show that random illumination microscopy (RIM) is a suited microscopy technique to study the sealing zone of osteoclasts on a bone mimetic substrate. They continue to use RIM to show that the dynamics of neighbouring podosomes correlate up to a distance of about 1.5um. They next show that within the sealing zone, groups of podosomes are surrounded by the classical adhesion adaptor proteins such as vinculin, talin and paxillin while actinin is present at the periphery of all single cores. This suggests that the sealing zone has an "intermediate" level of organization and that groups of podosomes form a functional unit within the sealing zone. The authors lastly demonstrate that the fluorescence intensity of the cores within these groups correlate with the intensity of the adaptor proteins that surrounds the group and that also the fluorescence intensity of the cores within one group correlates with each other.

      Strengths:

      The authors use bone slices to evaluate the nanoscale organization of cytoskeletal components in the sealing zone. Podosome conformations in osteoclasts strongly depend on the substrate type and the usage of bone slices accurately mimics the physiological environment in which osteoclasts reside in vivo.

      The authors use state-of-the-art imaging approaches to evaluation the nanoscale organization and dynamics of multiple podosome components in the sealing zone.

      The identification of groups of podosomes that demonstrate correlated dynamics within the sealing zone is a novel finding that is convincingly demonstrated.

      We thank the reviewer for these encouraging comments and the valuable suggestions below.

      Weaknesses:

      The rationale for the analysis performed on the DONALD super-resolution images (explained in Figure S1) is unclear. The analysis is also not properly explained and it is unclear how the data should be interpreted or put into context. Specific comments related to this analysis:

      – The authors make a distinction between towards the internal or external part of the cell when it comes to the height of the investigated proteins but it is unclear why this is done. Also, while the authors make this distinction, no conclusions are derived from this distinction and only the height values from towards the internal part of the cell are mentioned in the text.

      As the sealing zone is usually located near the cell periphery, we wondered whether the proximity of the peripheral plasma membrane could influence the molecular architecture of the structure, and a possible difference in tension between the inner and outer parts, and this is why we distinguished between the inner and outer side of the structures. However, our analyses revealed little difference between these two sides, the most striking being a closer proximity of the vinculin to the cores on the outer side of the belt. We now make this explicit in the manuscript (P3, L113116).

      • It is very much unclear how the distance of the investigated proteins towards the actin core is calculated. From Figure S1, it seems like a rectangle is taken that is centered around a podosome but the rectangle in the example contains more than one core. It seems like this would influence a proper interpretation of the data presented in the figures than contain the height values. The authors should better explain how the analysis was performed and how the analysis deals with the presence of multiple podosome cores in the rectangle of interest.

      We apologize for this omission. In order not to bias the analysis, the protein distance was calculated for all cores present, not just one. This is now specified in the legend of the figure.

      • In the text, the distance of the proteins with respect to the actin core is given (350nm-710nm depending on the specific protein and localization towards the external or internal part of the cell). It is mentioned that the measurements are not shown but it should be better explained how these numbers were derived from the data and the measurements (average, SD/SEM) should be shown.

      These values correspond to the maxima of the distributions of the different podosome markers shown in Figure 1G. Each of these proteins (vinculin, talin, filamin-A and paxillin) has a broad distribution marked by a depletion at the core, and not a peak as suggested by the first version of the manuscript. We propose not to indicate these values in the revised version in order to simplify the manuscript and not to confuse the reader.

      • Related to the previous comment. While it is mentioned that vinculin for example is located at ~500nm from the actin core, the height values (Figure 1E) are binned within 50nm of the core. This does not seem to match. It would be very helpful if the authors would add how many localizations are found so close to the core. Since this is expected to be low it would also be valuable it the authors would discuss what this means for difference in height between the molecules found close by and away from the core.

      Indeed, as shown in Figure 1G, vinculin is much less present in the center of actin cores than at 500 nm from these cores. The graph shown in Figure 1E, which shows the height of vinculin as a function of the distance to the core, without explaining the proportion of molecules detected, can indeed be confusing. This being said, a large number of molecules were detected, 197967 for the vinculin graph, including 5973 within 300 nm around the core, which is far from being negligible. To facilitate the understanding of this graph, as well as that of the graphs corresponding to the heights of the other proteins studied (Figures 1 and S2), we now superimpose on the height distributions, the frequency of the locations (new Figure 1E,F), still compiled in Figure 1G.

      • For cortactin, filamin A and actinin it is found that they reside on average at a height of approximately 150nm, even up to a large distance from the podosome core. It is unclear how these values should be interpreted. 150nm is way above the location where actin is expected to be (and also way above the average actin height that is found by the authors, with approximately 80nm more distant from the cores). The authors should add a discussion of what type of structures cortactin, filamin A and actinin would associate to at this position or how this height can be explained. This should also be included in the final model of Figure 6. In the current cartoon, filamin A for example seems to be associated with the integrins but this does not match with the height position observed by the authors.

      The average heights of cortactin, filamin-A and actinin are indeed around 150 nm, but are actually present over a wider range of heights (0-400nm), as shown in the histograms in Figure 1H. These values are therefore not inconsistent with the distribution of actin, which indeed has a lower average height, but is also present over this entire height (histogram now added in Figure 1H). These analyses suggest that there are different sets of actin filaments and that there is proportionally more cortactin, filamin-A and actinin on the high actin filaments, rather than on those close to the plasma membranes. To fully account for these results, we now point out the potential presence of different sets of actin filaments in the discussion (P7, L266-275) and corrected the model shown in the new Figure 6, placing a population of filamin A on the radial filaments, not just associated with integrins, and added filamin A and actinin in the side view of the model, to appreciate their likely localisation.

      The authors mention that the RIM resolution is 100nm and 300nm in the lateral and axial direction, respectively. This should also be confirmed on the bone slices with beads. It is well conceivable that the optical properties of bone have an effect on the optimal RIM resolution.

      In order to evaluate RIM resolution on osteoclast samples, as suggested by the reviewer, we did some experiments with beads and used the Fourier Ring Correlation Method (Nieuwenhuizen et al., Nat Methods 2013). This consists in making two RIM images with two different speckle illumination sequences, and comparing the correlations of the images in the Fourier space. The following figure shows the correlation curve as a function of spatial frequencies. The FIRE number, when the FRC curve reaches a correlation value of 1/7, gives an estimation of the resolution of the image.

      Using this approach, we evaluated the resolution to be of 125 nm, in average.

      The authors find three specific fluctuation periods (100s/25s/7s) but it is unclear what these periods mean. The authors only very briefly mention that these periods correlate with similar observations in macrophages but they should also add the implications of this finding and suggested a possible molecular mechanism that underlies these different fluctuations.

      We agree with this comment. So far, the mechanisms regulating these oscillations, whether purely mechanical or involving signaling, as well as and their importance for podosome and sealing zone function, are not yet understood. In van den Dries et al. Nat Commun 2013 and Labernadie et al. Nat Commun 2014, it was shown that these oscillations in macrophage podosomes depend on myosin IIA activity. It would thus be interesting to explore the effects of drugs interfering with actin polymerization on both the periodicity and the spatial synchrony properties of the sealing zone. We now discuss this point in the manuscript (P7, L296-300).

      The authors find that actinin-1 is localized around the podosome cores while filamin and vinculin surround groups of podosomes. The current representative images, though, that are chosen to support this difference display a very different density in podosome cores. The filamin and vinculin images seems to have a much denser podosome content compare to the actinin and cortactin images. I would encourage the authors to select images that are more comparable to fully appreciate the difference in localization of the associated proteins.

      This is a good point. Indeed, not all sealing zones are alike, especially with respect to the density of actin cores. This is why we have chosen to show a gallery of different cases (now in Figure S7), and not to intentionally select always the same patterns in the main figures in order not to mislead the reader. It is important to note that whatever the actin density, we find the same locations for the different proteins.

      In Figure 4 and 5, the authors show that the sealing zone is subdivided in groups of podosomes and it is implied that these for functional units within the sealing zone. Yet, it is unclear how persistent these groups are. Considering the dynamic nature of podosomes in other cell types (and as also demonstrated in the supplementary movies) it is well conceivable that these groups continuously fuse and remodel. To better define the nature of these groups of podosomes, the authors should add an analysis on these podosome groups and measure parameters such as group stability, podosome number per group, group size etc. This would very much enhance the novel aspects of the findings in this paper.

      Following the reviewer’s suggestion, we have quantified the number of podosomes per group and the group size. Measurements of these islets of clustered cores showed that they were 2.3 +/-2.1 µm² (average +/-SD) and contained in 7 +/-8 (average +/-SD) cores. These results are now included in the manuscript (P6, L213). Unfortunately, we could not accurately measure the stability of the clusters, as this would require a long, and challenging, time-lapse by RIM of osteoclasts expressing both paxillin-GFP and lifeact-mCherry, which we were able to achieve only on a few cells and on short timescales.

      The authors mention in the discussion that their finding about the groups of podosomes is very different from the "double circle" distribution found in previous publications. Yet, it is unclear what explains these different observations. While the authors use RIM super-resolution in this paper to assess the localization of the adaptor proteins, it is very unlikely that this is the source of this difference since the groups of podosomes would have been easily identified by conventional or confocal microscopy as well. The authors should add an extended discussion on how these differences could be explained and what this means for bone resorption properties.

      Indeed, our observation that the sealing zone is composed of islets of actin cores that are bordered by a network of adhesion complexes diverge from most of the previous studies describing a “double circle” organization. We believe that this difference may come, not only from the high resolution of our images, but mainly from the fact that most studies on the organization of sealing zones have been performed on mouse osteoclasts. We also believe that this particular organisation probably allows an efficient sealing of the osteoclast plasma membrane to the bone surface and maintains the resorption lacuna and the diffusion barrier. We now indicate this in the discussion (L7, P286-288).

    1. Author response:

      Reviewer #1 (Public Review):

      How does the brain respond to the input of different complexity, and does this ability to respond change with age?

      The study by Lalwani et al. tried to address this question by pulling together a number of neuroscientific methodologies (fMRI, MRS, drug challenge, perceptual psychophysics). A major strength of the paper is that it is backed up by robust sample sizes and careful choices in data analysis, translating into a more rigorous understanding of the sensory input as well as the neural metric. The authors apply a novel analysis method developed in human resting-state MRI data on task-based data in the visual cortex, specifically investigating the variability of neural response to stimuli of different levels of visual complexity. A subset of participants took part in a placebo-controlled drug challenge and functional neuroimaging. This experiment showed that increases in GABA have differential effects on participants with different baseline levels of GABA in the visual cortex, possibly modulating the perceptual performance in those with lower baseline GABA. A caveat is that no single cohort has taken part in all study elements, ie visual discrimination with drug challenge and neuroimaging. Hence the causal relationship is limited to the neural variability measure and does not extend to visual performance. Nevertheless, the consistent use of visual stimuli across approaches permits an exceptionally high level of comparability across (computational, behavioural, and fMRI are drawing from the same set of images) modalities. The conclusions that can be made on such a coherent data set are strong.

      The community will benefit from the technical advances, esp. the calculation of BOLD variability, in the study when described appropriately, encouraging further linkage between complementary measures of brain activity, neurochemistry, and signal processing.

      Thank you for your review. We agree that a future study with a single cohort would be an excellent follow-up.

      Reviewer #2 (Public Review):

      Lalwani et al. measured BOLD variability during the viewing of houses and faces in groups of young and old healthy adults and measured ventrovisual cortex GABA+ at rest using MR spectroscopy. The influence of the GABA-A agonist lorazepam on BOLD variability during task performance was also assessed, and baseline GABA+ levels were considered as a mediating variable. The relationship of local GABA to changes in variability in BOLD signal, and how both properties change with age, are important and interesting questions. The authors feature the following results: 1) younger adults exhibit greater task-dependent changes in BOLD variability and higher resting visual cortical GABA+ content than older adults, 2) greater BOLD variability scales with GABA+ levels across the combined age groups, 3) administration of a GABA-A agonist increased condition differences in BOLD variability in individuals with lower baseline GABA+ levels but decreased condition differences in BOLD variability in individuals with higher baseline GABA+ levels, and 4) resting GABA+ levels correlated with a measure of visual sensory ability derived from a set of discrimination tasks that incorporated a variety of stimulus categories.

      Strengths of the study design include the pharmacological manipulation for gauging a possible causal relationship between GABA activity and task-related adjustments in BOLD variability. The consideration of baseline GABA+ levels for interpreting this relationship is particularly valuable. The assessment of feature-richness across multiple visual stimulus categories provided support for the use of a single visual sensory factor score to examine individual differences in behavioral performance relative to age, GABA, and BOLD measurements.

      Weaknesses of the study include the absence of an interpretation of the physiological mechanisms that contribute to variability in BOLD signal, particularly for the chosen contrast that compared viewing houses with viewing faces.

      Whether any of the observed effects can be explained by patterns in mean BOLD signal, independent of variability would be useful to know.

      One of the first pre-processing steps of computing SDBOLD involves subtracting the block-mean from the fMRI signal for each task-condition. Therefore, patterns observed in BOLD signal variability are not driven by the mean-BOLD differences. Moreover, as noted above, to further confirm this, we performed additional mean-BOLD based analysis (See Supplementary Materials Pg 3). Results suggest that ∆⃗ MEANBOLD is actually larger in older adults vs. younger adults (∆⃗ SDBOLD exhibited the opposite pattern), but more importantly ∆⃗ MEANBOLD is not correlated with GABA or with visual performance. This is also consistent with prior research (Garrett et.al. 2011, 2013, 2015, 2020) that found MEANBOLD to be relatively insensitive to behavioral performance.

      The positive correlation between resting GABA+ levels and the task-condition effect on BOLD variability reaches significance at the total group level, when the young and old groups are combined, but not separately within each group. This correlation may be explained by age-related differences since younger adults had higher values than older adults for both types of measurements. This is not to suggest that the relationship is not meaningful or interesting, but that it may be conceptualized differently than presented.

      Thank you for this important point. The relationship between GABA and ∆⃗ SDBOLD shown in Figure 3 is also significant within each age-group separately (Line 386-388). The model used both age-group and GABA as predictors of ∆⃗ SDBOLD and found that both had a significant effect, while the Age-group x GABA interaction was not significant. The effect of age on ∆⃗ SDBOLD therefore does not completely explain the observed relationship between GABA and ∆⃗ SDBOLD because this latter effect is significant in both age-groups individually and in the whole sample even when variance explained by age is accounted for. The revision clarifies this important point (Ln 488-492). Thanks for raising it.

      Two separate dosages of lorazepam were used across individuals, but the details of why and how this was done are not provided, and the possible effects of the dose are not considered.

      Good point. We utilized two dosages to maximize our chances of finding a dosage that had a robust effect. The specific dosage was randomly assigned across participants and the dosage did not differ across age-groups or baseline GABA levels. We also controlled for the drug-dosage when examining the role of drug-related shift in ∆⃗ SDBOLD. We have clarified these points in the revision and highlighted the analysis that found no effect of dosage on drug-related shift in ∆⃗ SDBOLD (Line 407-418).

      The observation of greater BOLD variability during the viewing of houses than faces may be specific to these two behavioral conditions, and lingering questions about whether these effects generalize to other types of visual stimuli, or other non-visual behaviors, in old and young adults, limit the generalizability of the immediate findings.

      We agree that examining the factors that influence BOLD variability is an important topic for future research. In particular, although it is increasingly well known that variability modulation itself can occur in a host of different tasks and research contexts across the lifespan (see Garrett et al., 2013 Waschke et al., 2021), to address the question of whether variability modulation occurs directly in response to stimulus complexity in general, it will be important for future work to examine a range of stimulus categories beyond faces and houses. Doing so is indeed an active area of research in Dr. Garrett’s group, where visual stimuli from many different categories are examined (e.g., for a recent approach, see Waschke et.al.,2023 (biorxiv)). Regardless, only face and house stimuli were available in the current dataset. We therefore exploited the finding that BOLD variability tends to be larger for house stimuli than for face stimuli (in line with the HMAX model output) to demonstrate that the degree to which a given individual modulates BOLD variability in response to stimulus category is related to their age, to GABA levels, and to behavioral performance.

      The observed age-related differences in patterns of BOLD activity and ventrovisual cortex GABA+ levels along with the investigation of GABA-agonist effects in the context of baseline GABA+ levels are particularly valuable to the field, and merit follow-up. Assessing background neurochemical levels is generally important for understanding individualized drug effects. Therefore, the data are particularly useful in the fields of aging, neuroimaging, and vision research.

      Thank you, we agree!

      Reviewer #3 (Public Review):

      The role of neural variability in various cognitive functions is one of the focal contentions in systems and computational neuroscience. In this study, the authors used a largescale cohort dataset to investigate the relationship between neural variability measured by fMRI and several factors, including stimulus complexity, GABA levels, aging, and visual performance. Such investigations are valuable because neural variability, as an important topic, is by far mostly studied within animal neurophysiology. There is little evidence in humans. Also, the conclusions are built on a large-scale cohort dataset that includes multi-model data. Such a dataset per se is a big advantage. Pharmacological manipulations and MRS acquisitions are rare in this line of research. Overall, I think this study is well-designed, and the manuscript reads well. I listed my comments below and hope my suggestions can further improve the paper.

      Strength:

      1). The study design is astonishingly rich. The authors used task-based fMRI, MRS technique, population contrast (aging vs. control), and psychophysical testing. I appreciate the motivation and efforts for collecting such a rich dataset.

      2) The MRS part is good. I am not an expert in MRS so cannot comment on MRS data acquisition and analyses. But I think linking neural variability to GABA in humans is in general a good idea. There has been a long interest in the cause of neural variability, and inhibition of local neural circuits has been hypothesized as one of the key factors. 3. The pharmacological manipulation is particularly interesting as it provides at least evidence for the causal effects of GABA and deltaSDBOLD. I think this is quite novel.

      Weakness:

      1) I am concerned about the definition of neural variability. In electrophysiological studies, neural variability can be defined as Poisson-like spike count variability. In the fMRI world, however, there is no consensus on what neural variability is. There are at least three definitions. One is the variability (e.g., std) of the voxel response time series as used here and in the resting fMRI world. The second is to regress out the stimulusevoked activation and only calculate the std of residuals (e.g., background variability). The third is to calculate variability of trial-by-trial variability of beta estimates of general linear modeling. It currently remains unclear the relations between these three types of variability with other factors. It also remains unclear the links between neuronal variability and voxel variability. I don't think the computational principles discovered in neuronal variability also apply to voxel responses. I hope the authors can acknowledge their differences and discuss their differences.

      These are very important points, thank you for raising them. Although we agree that the majority of the single cell electrophysiology world indeed seems to prefer Poisson-like spiking variability as an easy and tractable estimate, it is certainly not the only variability approach in that field (e.g., entropy; see our most recent work in humans where spiking entropy outperforms simple spike counts to predict memory performance; Waschke et al., 2023, bioRxiv). In LFP, EEG/MEG and fMRI, there is indeed no singular consensus on what variability “is”, and in our opinion, that is a good thing. We have reported at length in past work about entire families of measures of signal variability, from simple variance, to power, to entropy, and beyond (see Table 1 in Waschke et al, 2021, Neuron). In principle, these measures are quite complementary, obviating the need to establish any single-measure consensus per se. Rather than viewing the three measures of neural variability that the reviewer mentioned as competing definitions, we prefer to view them as different sources of variance. For example, from each of the three sources of variance the reviewer suggests, any number of variability measures could be computed.

      The current study focuses on using the standard deviation of concatenated blocked time series separately for face and house viewing conditions (this is the same estimation approach used in our very earliest studies on signal variability; Garrett et al., 2010, JNeurosci). In those early studies, and nearly every one thereafter (see Waschke et al., 2021, Neuron), there is no ostensible link between SDBOLD (as we normaly compute it) and average BOLD from either multivariate or GLM models; as such, we do not find any clear difference in SDBOLD results whether or not average “evoked” responses are removed or not in past work. This is perhaps also why removing ERPs from EEG time series rarely influences estimates of variability in our work (e.g., Kloosterman et al., 2020, eLife).

      The third definition the reviewer notes refers to variability of beta estimates over trials. Our most recent work has done exactly this (e.g., Skowron et al., 2023, bioRxiv), calculating the SD even over single time point-wise beta estimates so that we may better control the extraction of time points prior to variability estimation. Although direct comparisons have not yet been published by us, variability over single TR beta estimates and variability over the time series without beta estimation are very highly correlated in our work (in the .80 range; e.g., Kloosterman et al., in prep).

      Re: the reviewer’s point that “It also remains unclear the links between neuronal variability and voxel variability. I don’t think the computational principles discovered in neuronal variability also apply to voxel responses. I hope the authors can acknowledge their differences and discuss their differences.” If we understand correctly, the reviewer maybe asking about within-person links between single-cell neuronal variability (to allow Poisson-like spiking variability) and voxel variability in fMRI? No such study has been conducted to date to our knowledge (such data almost don’t exist). Or rather, perhaps the reviewer is noting a more general point regarding the “computational principles” of variability in these different domains? If that is true, then a few points are worth noting. First, there is absolutely no expectation of Poisson distributions in continuous brain imaging-based time series (LFP, E/MEG, fMRI). To our knowledge, such distributions (which have equivalent means and variances, allowing e.g., Fano factors to be estimated) are mathematically possible in spiking because of the binary nature of spikes; when mean rates rise, so too do variances given that activity pushes away from the floor (of no activity). In continuous time signals, there is no effective “zero”, so a mathematical floor does not exist outright. This is likely why means and variances are not well coupled in continuous time signals (see Garrett et al., 2013, NBR; Waschke et al., 2021, Neuron); anything can happen. Regardless, convergence is beginning to be revealed between the effects noted from spiking and continuous time estimates of variability. For example, we show that spiking variability can show a similar, behaviourally relevant coupling to the complexity of visual input (Waschke et al., 2023, bioRxiv) as seen in the current study and in past work (e.g., Garrett et al., 2020, NeuroImage). Whether such convergence reflects common computational principles of variability remains to be seen in future work, despite known associations between single cell recordings and BOLD overall (e.g., Logothetis and colleagues, 2001, 2002, 2004, 2008).

      Given the intricacies of these arguments, we don’t currently include this discussion in the revised text. However, we would be happy to include aspects of this content in the main paper if the reviewer sees fit.

      2) If I understand it correctly, the positive relationship between stimulus complexity and voxel variability has been found in the author's previous work. Thus, the claims in the abstract in lines 14-15, and section 1 in results are exaggerated. The results simply replicate the findings in the previous work. This should be clearly stated.

      Good point. Since this finding was a replication and an extension, we reported these results mostly in the supplementary materials. The stimulus set used for the current study is different than Garrett et.al. 2020 and therefore a replication is important. Moreover, we have extended these findings across young and older adults (previous work was based on older adults alone). We have modified the text to clarify what is a replication and what part are extension/novel about the current study now (Line 14, 345 and 467). Thanks for the suggestion.

      3) It is difficult for me to comprehend the U-shaped account of baseline GABA and shift in deltaSDBOLD. If deltaSDBOLD per se is good, as evidenced by the positive relationship between brainscore and visual sensitivity as shown in Fig. 5b and the discussion in lines 432-440, why the brain should decrease deltaSDBOLD ?? or did I miss something? I understand that "average is good, outliers are bad". But a more detailed theory is needed to account for such effects.

      When GABA levels are increased beyond optimal levels, neuronal firing rates are reduced, effectively dampening neural activity and limiting dynamic range; in the present study, this resulted in reduced ∆⃗ SDBOLD. Thus, the observed drug-related decrease in ∆⃗ SDBOLD was most present in participants with already high levels of GABA. We have now added an explanation for the expected inverted-U (Line 523-546). The following figure tries to explain this with a hypothetical curve diagram and how different parts of Fig 4 might be linked to different points in such a curve.

      Author response image 1.

      Line 523-546 – “We found in humans that the drug-related shift in ∆⃗ SDBOLD could be either positive or negative, while being negatively related to baseline GABA. Thus, boosting GABA activity with drug during visual processing in participants with lower baseline GABA levels and low levels of ∆⃗ SDBOLD resulted in an increase in ∆⃗ SDBOLD (i.e., a positive change in ∆⃗ SDBOLD on drug compared to off drug). However, in participants with higher baseline GABA levels and higher ∆⃗ SDBOLD, when GABA was increased presumably beyond optimal levels, participants experienced no-change or even a decrease in∆⃗ SDBOLD on drug. These findings thus provide the first evidence in humans for an inverted-U account of how GABA may link to variability modulation.

      Boosting low GABA levels in older adults helps increase ∆⃗ SDBOLD, but why does increasing GABA levels lead to reduced ∆⃗ SDBOLD in others? One explanation is that higher than optimal levels of inhibition in a neuronal system can lead to dampening of the entire network. The reduced neuronal firing decreases the number of states the network can visit and decreases the dynamic range of the network. Indeed, some anesthetics work by increasing GABA activity (for example propofol a general anesthetic modulates activity at GABAA receptors) and GABA is known for its sedative properties. Previous research showed that propofol leads to a steeper power spectral slope (a measure of the “construction” of signal variance) in monkey ECoG recordings (Gao et al., 2017). Networks function optimally only when dynamics are stabilized by sufficient inhibition. Thus, there is an inverted-U relationship between ∆⃗ SDBOLD and GABA that is similar to that observed with other neurotransmitters.”

      4) Related to the 3rd question, can you show the relationship between the shift of deltaSDBOLD (i.e., the delta of deltaSDBOLD) and visual performance?

      We did not have data on visual performance from the same participants that completed the drug-based part of the study (Subset1 vs 3; see Figure 1); therefore, we unfortunately cannot directly investigate the relationship between the drug-related shift of ∆⃗ SDBOLD and visual performance. We have now highlighted that this as a limitation of the current study (Line 589-592), where we state: One limitation of the current study is that participants who received the drug-manipulation did not complete the visual discrimination task, thus we could not directly assess how the drug-related change in ∆⃗ SDBOLD impacted visual performance.

      5) Are the dataset openly available?? I didn't find the data availability statement.

      An excel-sheet with all the processed data to reproduce figures and results has been included in source data submitted along with the manuscript along with a data dictionary key for various columns. The raw MRI, MRS and fMRI data used in the current manuscript was collected as a part of a larger (MIND) study and will eventually be made publicly available on completion of the study (around 2027). Before that time, the raw data can be obtained for research purposes upon reasonable request. Processing code will be made available on GitHub.

    1. Author Response:

      Reviewer #1:

      Salehinejad et al. run a battery of tests to investigate the effects of sleep deprivation on cortical excitability using TMS, LTP/LTD-like plasticity using tDCS, EEG-derived measures and behavioral task-performance. The study confirms evidence for sleep deprivation resulting in an increase in cortical excitability, diminishing LTP-like plasticity changes, increase in EEG theta band-power and worse task-performance. Additionally, a protocol usual resulting in LTD-like plasticity results in LTP-like changes in the sleep deprivation condition.

      We appreciate the reviewer's time for carefully reading our work and providing important suggestions/recommendations. In what follows, we addressed the comments one by one, revised the main text accordingly, and pasted the changes here as well.

      1) My main comment is regarding the motivation for executing this specific study setup, which did not become clear to me. It's a robust experimental design, with general approach quite similar to the (in the current manuscript heavily cited) Kuhn et al. 2016 study (which investigates cortical excitability, EEG markers, and changes in LTP mechanisms), with additional inclusion of LTD-plasticity measures. The authors list comprehensiveness as motivation, but the power of a comprehensive study like this would lie in being able to make comparisons across measures to identify new interrelations or interesting subgroups of participants differentially affected by sleep deprivations. These comparisons are presented in l. 322 and otherwise at the end of the supplementary material and the study does not seem to be designed with these as the main motivation in mind. Can the authors could comment on this & clarify their motivation? Maybe the authors can highlight in what way their study constitutes a methodological improvement and incorporates new aspects regarding hypothesis development as compared to e.g. Kuhn et al. 2016; currently, the authors highlight mainly the addition of LTD-plasticity protocols. Similarly, no motivation/context/hypotheses are given for saliva testing. There are a lot of different results, but e.g. the cortical excitability results are not discussed in depth, e.g. there is no effect on IO curve, but on other measures of excitability, the conclusion of that paragraph is only "our results demonstrate that corticocortical and corticospinal excitability are upscaled after sleep deprivation." There are some conflicting results regarding cortical excitability measures in the literature, possibly this could be discussed, so the reader can evaluate in what way the current study constitutes an improvement, for instance methodologically, over previous studies.

      Thank you for your comment/suggestion. The main motivation behind this study was to examine different physiological/behavioral/cognitive measures under sleep conditions and to provide a reasonably complete overview. This approach was not covered in detail by previous work, which is often limited to one or two pieces of behavioral and/or physiological evidence. Our study was not sufficiently powered to identify new interrelations between measures, because this was a secondary aim, although we found some relevant associations in exploratory analyses (i.e., association of motor learning with plasticity, and cortical excitability with memory and attention). Future studies, however, which are sufficiently powered for these comparisons, are needed to explore interrelations between physiological, and cognitive parameters more clearly and we stated this as a limitation (Page 22).

      That said, we agree that specific rationales of the study were not sufficiently clarified in the previous version. We rephrased and clarified respective motivations and rationales here:

      1) By comprehensive, we mean that we obtained measures from basic physiological parameters to behavior and higher-order cognition, which is not sufficiently covered so far. This includes also the exploration of expected associations between behavioral motor learning and plasticity measures, as well as excitability parameters and cognitive functions.

      2) In the Kuhn et al. (2016) study, cortical excitability was obtained by TMS intensity (single- pulse protocol) to elicit a predefined amplitude of the motor-evoked potential, which is a relatively unspecific parameter of corticospinal excitability. In the present study, cortical excitability was monitored by different TMS protocols, which cover not only corticospinal excitability, but also intracortical inhibition, facilitation, I-wave facilitation, and short-latency afferent inhibition, which allow more specific conclusions with respect to the involvement of cortical systems, neurotransmitters, and -modulators.

      3) Furthermore, Kuhn et al (2016) only investigated LTP-like, but not LTD-like plasticity. LTD- like plasticity was also not investigated in previous works to the best of our knowledge. LTD- like plasticity has however relevance for cognitive processing, and furthermore, knowledge about alterations of this kind of plasticity is important for mechanistic understanding of sleep- dependent plasticity alterations: The conversion of LTD-like to LTP-like plasticity under sleep deprivation is crucial for the interpretation of the study results as likely caused by cortical hyperactivity.

      4) Finally, an important motivation was to compare how brain physiology and cognition are differently affected by sleep deprivation, as compared to chronotype-dependent brain physiology, and cognitive performance, especially with respect to brain physiology, and performance at non-preferred times of the day. Our findings regarding the latter were recently published (Salehinejad et al., 2021) and comparisons of the present study with the published one have a novel, and important implications. Specifically, the results of both studies imply that the mechanistic background of sleep deprivation-, and non-optimal time of day performance- dependent reduced performance differs relevantly.

      We clarified these motivations in the introduction and discussion. Please see the revised text below:

      "The number of available studies about the impact of sleep deprivation on human brain physiology relevant for cognitive processes is limited, and knowledge is incomplete. With respect to cortical excitability, Kuhn et al. (2016) showed increased excitability under sleep deprivation via a global measure of corticospinal excitability, the TMS intensity needed to induce motor-evoked potentials of a specific amplitude. Specific information about the cortical systems, including neurotransmitters, and - modulators involved in these effects (e.g. glutamatergic, GABAergic, cholinergic), is however missing. The level of cortical excitability affects neuroplasticity, a relevant physiological derivate of learning, and memory formation. Kuhn and co-workers (2016) describe accordingly a sleep deprivation-dependent alteration of LTP-like plasticity in humans. The effects of sleep deprivation on LTD-like plasticity, which is required for a complete picture, have however not been explored so far. In the present study, we aimed to complete the current knowledge and explored also cognitive performance on those tasks which critically depend on cortical excitability (working memory, and attention), and neuroplasticity (motor learning) to gain mechanistic knowledge about sleep deprivation-dependent performance decline. Finally, we aimed to explore if the impact of sleep deprivation on brain physiology and cognitive performance differs from the effects of non-optimal time of day performance in different chronotypes, which we recently explored in a parallel study with an identical experimental design (Salehinejad et al., 2021). The use of measures of different modalities in this study allows us to comprehensively investigate the impact of sleep deprivation on brain and cognitive functions which is largely missing in the human literature."

      We added more details about the rationale for saliva sampling:

      "We also assessed resting-EEG theta/alpha, as an indirect measure of homeostatic sleep pressure, and examined cortisol and melatonin concentration to see how these are affected under sleep conditions, given the reported mixed effects in previous studies."

      We also rephrased the cortical excitability results. Please see the revised text below:

      "Taken together, our results demonstrate that glutamate-related intracortical excitability is upscaled after sleep deprivation. Moreover, cortical inhibition was decreased or turned into facilitation, which is indicative of enhanced cortical excitability as a result of GABAergic reduction. Corticospinal excitability did only show a trendwise upscaling, indicative for a major contribution of cortical, but not downstream excitability to this sleep deprivation-related enhancement."

      "The increase of cortical excitability parameters and the resultant synaptic saturation following sleep deprivation can explain the respective cognitive performance decline. It is, however, worth noting that our study was not powered to identify these correlations with sufficient reliability, and future studies that are powered for this aim are needed.

      Our findings have several implications. First, they show that sleep and circadian preference (i.e., chronotype) have functionally different impacts on human brain physiology and cognition. The same parameters of brain physiology and cognition were recently investigated at circadian optimal vs non-optimal time of day in two groups of early and late chronotypes (Salehinejad et al., 2021). While we found decreased cortical facilitation and lower neuroplasticity induction (same for both LTP and LTD) at the circadian nonpreferred time in that study (Salehinejad et al., 2021), in the present study we observed upscaled cortical excitability and a functionally different pattern of neuroplasticity alteration (i.e., diminished LTP-like plasticity induction and conversion of LTD- to LTP-like plasticity)."

      2) EEG-measures. In general, I find the presented evidence regarding a link between synaptic strength and human theta-power is weak. In humans, rhythmic theta activity can be found mostly in the form of midfrontal theta. Here, the largest changes seem to be in posterior electrodes (judging according to in Fig 4 bottom row), which will not capture rhythmic midfrontal theta in humans. Can the authors explain the scaling of the Fig. 4 top vs. bottom row, there seems to be a mismatch? No legend is given for the bottom row. The activity captured here is probably related to changes in nonrhythmic 1/f-type activity (which displays large changes relating to arousal: e.g. https://elifesciences.org/articles/55092. It would be of benefit to see a power spectrum for the EEG-measures to see the specific type of power changes across all frequencies & to verify that these are actually oscillatory peaks in individual subjects. As far as I understood, the referenced study Vyazovskiy et al., 2008 contains no information regarding theta as a marker for synaptic potentiation. The evidence that synaptic strength is captured by the specifically used measures needs to be strengthened or statements like "measured synaptic strength via the resting-EEG theta/alpha pattern" need to be more carefully stated.

      Thank you for this comment. We removed the Pz electrode from the figure and instead added F3 and F4 along with Fz and Cz to capture more mid-frontal regions. Please see the revised Figure 4. The top rows now include only midfrontal and midcentral areas (Fz, Cz, F3, F4), and show numerical comparisons of midfrontal theta which is significantly different across conditions (and larger after sleep deprivation). The purpose of the bottom figures, which are removed now, was just to provide an overall visual comparison of theta distribution across sleep conditions. However, we agree that the bottom-row figures are misleading because these just capture average theta band power without specifying midfrontal regions. We removed this part of the figure to prevent confusion. Please see below.

      Regarding the power spectrum, we also added new figures (4 g) showing how different frequency bands of the power spectrum are affected by sleep deprivation. Please see the revised Figure 4 below.

      Updated results, page 12-13:

      "In line with this, we investigated how sleep deprivation affects resting-state brain oscillations at the theta band (4-7 Hz), the beta band (15-30 Hz) as another marker of cortical excitability, vigilance and arousal (Eoh et al., 2005; Fischer et al., 2008) and the alpha band (8-14 Hz) which is important for cognition (e.g. memory, attention) (Klimesch, 2012). To this end, we analyzed EEG spectral power at mid-frontocentral electrodes (Fz, Cz, F3, F4) using a 4×2 mixed ANOVA. For theta activity, significant main effects of location (F1.71=18.68, p<0.001; ηp2=0.40) and sleep condition (F1=17.82, p<0.001; ηp2=0.39), but no interaction was observed, indicating that theta oscillations at frontocentral regions were similarly affected by sleep deprivation. Post hoc tests (paired, p<0.05) revealed that theta oscillations, grand averaged at mid-central electrodes, were significantly increased after sleep deprivation (p<0.001) (Fig. 4a,b). For the alpha band, the main effects of location (F1.49=12.92, p<0.001; ηp2=0.31) and sleep condition (F1=5.03, p=0.033; ηp2=0.15) and their interaction (F2.31=4.60, p=0.010; ηp2=0.14) were significant. Alpha oscillations, grand averaged at mid-frontocentral electrodes, were significantly decreased after sleep deprivation (p=0.033) (Fig. 4c,d). Finally, the analysis of beta spectral power showed significant main effects of location (F1.34=6.73, p=0.008; ηp2=0.19) and sleep condition (F1=6.98, p=0.013; ηp2=0.20) but no significant interaction. Beta oscillations, grand averaged at mid-frontocentral electrodes, were significantly increased after sleep deprivation (p=0.013) (Fig. 4e,f)."

      Fig. 4. Resting-state theta, alpha, and beta oscillations at electrodes Fz, Cz, F3 and F4. a,b Theta band activity was significantly higher after the sleep deprivation vs sufficient sleep condition (tFz=4.61, p<0.001; tCz=2.22, p=0.034; tF3=2.93, p=0.007; tF4=4.78, p<0.001). c,d, Alpha band activity was significantly lower at electrodes Fz and Cz (tFz=2.39, p=0.023; tCz=2.65, p=0.013) after the sleep deprivation vs the sufficient sleep condition. e,f, Beta band activity was significantly higher at electrodes Fz, Cz and F4 after sleep deprivation compared with the sufficient sleep condition (tFz=3.06, p=0.005; tCz=2.38, p= 0.024; tF4=2.25, p=0.032). g, Power spectrum including theta (4-7 Hz), alpha (8-14 Hz), and beta (15-30 Hz) bands at the electrodes Fz, Cz, F3 and F4 respectively. Data of one participant were excluded due to excessive noise. All pairwise comparisons for each electrode were calculated via post hoc Student’s t-tests (paired, p<0.05). n=29. Error bars represent s.e.m. ns = nonsignificant; Asterisks indicate significant differences. Boxes indicate the interquartile range that contains 50% of values (range from the 25th to the 75th percentile) and whiskers show the 1 to 99 percentiles.

      Regarding the reference, unfortunately, we were referring to a different work of the Vyazovskiy team. We meant Vyazovskiy et al. (2005). We removed this reference and the part that needed to be toned down from the introduction and added new relevant references while tuning down the statement about synaptic strength. Please see below:

      Revised text, Results, page 12:

      "So far, we found that sleep deprivation upscales cortical excitability, prevents induction of LTP-like plasticity, presumably due to saturated synaptic potentiation, and converts LTD- into LTP-like plasticity. Previous studies in animals (Vyazovskiy and Tobler, 2005; Leemburg et al., 2010) and humans (Finelli et al., 2000) have shown that EEG theta activity is a marker for homeostatic sleep pressure and increased cortical excitability (Kuhn et al., 2016)."

      3) In general, the authors generally do a good job pointing out multiple comparison corrected tests. In some cases, e.g. for their correlational analyses across measures, significant results are reported, but without a clearer discussion on what other tests were computed and how correction was applied, the evidence strength of these are hard to evaluate. Please check for all presented correlations.

      Thank you for your comment. For correlational analyses, no correction for multiple comparisons was computed, because these were secondary exploratory analyses. We state this now clearly in the manuscript. For the other analyses, the description of multiple comparisons is included below:

      Methods, pages 35-37:

      "For the TMS protocols with a double-pulse condition (i.e., SICI-ICF, I-wave facilitation, SAI), the resulting mean values were normalized to the respective single-pulse condition. First, mean values were calculated individually and then inter-individual means were calculated for each condition. For the I-O curves, absolute MEP values were used. To test for statistical significance, repeated-measures ANOVAs were performed with ISIs, TMS intensity (in I-O curve only), and condition (sufficient sleep vs sleep deprivation) as within-subject factors and MEP amplitude as the dependent variable. In case of significant results of the ANOVA, post hoc comparisons were performed using Bonferroni-corrected t-tests to compare mean MEP amplitudes of each condition against the baseline MEP and to contrast sufficient sleep vs sleep deprivation conditions. To determine if individual baseline measures differed within and between sessions, SI1mV and Baseline MEP were entered as dependent variables in a mixed-model ANOVA with session (4 levels) and condition (sufficient sleep vs sleep deprivation) as within-subject factors, and group (anodal vs cathodal) as between-subject factor. The mean MEP amplitude for each measurement time-point was normalized to the session’s baseline (individual quotient of the mean from the baseline mean) resulting in values representing either increased (> 1.0) or decreased (< 1.0) excitability. Individual averages of the normalized MEP from each time-point were then calculated and entered as dependent variables in a mixed-model ANOVA with repeated measures with stimulation condition (active, sham), time-point (8 levels), and sleep condition (normal vs deprivation) as within-subject factors and group (anodal vs cathodal) as between-subject factor. In case of significant ANOVA results, post hoc comparisons of MEP amplitudes at each time point were performed using Bonferroni-corrected t-tests to examine if active stimulation resulted in a significant difference relative to sham (comparison 1), baseline (comparison 2), the respective stimulation condition at sufficient sleepvs sleep deprivation (comparison 3), and the between-group comparisons at respective timepoints (comparison 4).

      The mean RT, RT variability and accuracy of blocks were entered as dependent variables in repeated-measures ANOVAs with block (5, vs 6, 6 vs 7) and condition (sufficient sleep vs sleep deprivation) as within-subject factors. Because the RT differences between blocks 5 vs 6 and 6 vs 7 were those of major interest, post hoc comparisons were performed on RT differences between these blocks using paired-sample t-tests (two-tailed, p<0.05) without correction for multiple comparisons. For 3-back, Stroop and AX-CPT tasks, mean and standard deviation of RT and accuracy were calculated and entered as dependent variables in repeated-measures ANOVAs with sleep condition (sufficient sleep vs sleep deprivation) as the within-subject factor. For significant ANOVA results, post hoc comparisons of dependent variables were performed using paired-sample t-tests (two-tailed, p<0.05) without correction for multiple comparisons.

      For the resting-state data, brain oscillations at mid-central electrodes (Fz, Cz, F3, F4) were analyzed with a 4×2 ANOVA with location (Fz, Cz, F3, F4) and sleep condition (sufficient sleep vs sleep deprivation) as the within-subject factors. For all tasks, individual ERP means were grand-averaged and entered as dependent variables in repeated-measures ANOVAs with sleep condition (sufficient sleep vs sleep deprivation) as the within-subject factor. Post hoc comparisons of grand-averaged amplitudes was performed using paired-sample t-tests (two-tailed, p<0.05) without correction for multiple comparisons.

      To assess the relationship between induced neuroplasticity and motor sequence learning, and the relationship between cortical excitability and cognitive task performance, we calculated Pearson correlations. For the first correlation, we used individual grand-averaged MEP amplitudes obtained from anodal and cathodal tDCS pooled for the time-points between 0, and 20 min after interventions, and individual motor learning performance (i.e. BL6-5 and BL6-7 RT difference) across sleep conditions. For the second correlation, we used individual grand-averaged MEP amplitudes obtained from each TMS protocol and individual accuracy/RT obtained from each task across sleep conditions. No correction for multiple comparisons was done for correlational analyses as these were secondary exploratory analyses."

      There are also inconsistencies like: " The average levels of cortisol and melatonin were lower after sleep deprivation vs sufficient sleep (cortisol: 3.51{plus minus}2.20 vs 4.85{plus minus}3.23, p=0.05; melatonin 10.50{plus minus}10.66 vs 16.07{plus minus}14.94, p=0.16)"

      The p-values are not significant here?

      Thank you for your comment. The p-value was only marginally significant for the cortisol level changes. We clarified this in the revision. Please see below:

      Revised text, page 19:

      "The average levels of cortisol and melatonin were numerically lower after sleep deprivation vs sufficient sleep (cortisol: 3.51±2.20 vs 4.85±3.23, p=0.056; melatonin 10.50±10.66 vs 16.07±14.94, p=0.16), but these differences were only marginally significant for the cortisol level and showed only a trendwise reduction for melatonin."

      Reviewer #2:

      This study represents the currently most comprehensive characterization of indices of synaptic plasticity and cognition in humans in the context of sleep deprivation. It provides further support for an interplay between the time course of synaptic strength/cortical excitability (homeostatic plasticity) and the inducibility of associative synaptic LTP- LTD-like plasticity. The study is of great interest, the translation of findings is of potential clinical relevance, the methods appear to be solid and the results are mostly convincing. I believe that the writing of the manuscript should be improved (e.g. quality of referencing), clearer framework and hypothesis, reduction of redundancies, and more precise discussion. However, all of these points can be addressed since the overall concept, design, conduct and findings are convincing and of great interest to the field of sleep research, but also more broader to the neurosciences, to clinicians and the public.

      We appreciate the reviewer's time for carefully reading our work and providing important suggestions/recommendations.

    1. Author Response

      Reviewer #1 (Public Review):

      The manuscript by Lujan and colleagues describes a series of cellular phenotypes associated with the depletion of TANGO2, a poorly characterized gene product but relevant to neurological and muscular disorders. The authors report that TANGO2 associates with membrane-bound organelles, mainly mitochondria, impacting in lipid metabolism and the accumulation of reactive-oxygen species. Based on these observations the authors speculate that TANGO2 function in Acyl-CoA metabolism.

      The observations are generally convincing and most of the conclusions appear logical. While the function of TANGO2 remains unclear, the finding that it interferes with lipid metabolism is novel and important. This observation was not developed to a great extent and based on the data presented, the link between TANGO2 and acyl-CoA, as proposed by the authors, appears rather speculative.

      We thank you for your advice and now include additional data that lends support to the role of TANGO2 in lipid metabolism. We have changed the title accordingly.

      1) The data with overexpressed TANGO2 looks convincing but I wonder if the authors analyzed the localization of endogenous TANGO2 by immunofluorescence using the antibody described in Figure S2. The idea that TANGO2 localizes to membrane contact sites between mitochondria and the ER and LDs would also be strengthened by experiments including multiple organelle markers.

      We agree that most of the data on TANGO2 localization are based on the overexpression of the protein. As suggested by the reviewer and despite the lack of commercial antibodies for immunofluorescence-based evaluation, see the following chart, we tested the commercial antibody described in Figure 2 on HepG2 and U2OS cells. Moreover, we used Förster resonance energy transfer (FRET) technology to analyze the proximity of TANGO2 and Tom20, a specific outer mitochondrial membrane protein. In addition, we visualized cells expressing tagged TANGO2 and tagged VAP-B, an integral ER protein in the mitochondria-associated membranes (doi:10.1093/hmg/ddr559) or tagged TANGO2 and tagged GPAT4-Hairpin, an integral LD protein (doi:10.1016/j.devcel.2013.01.013). These data strengthen our proposal and are presented in the revised manuscript.

      As suggested by the reviewer, we have also visualized two additional cell lines (HepG2 and U2OS) with the anti-TANGO2( from Novus Biologicals) that have been used for western blot (see chart above). As shown in the following figure, the commercial antibody shows a lot of staining in addition to mitochondria, especially in U2OS cells, where it also appears to label the nucleus.

      2) The changes in LD size in TANGO2-depleted cells are very interesting and consistent with the role of TANGO2 in lipid metabolism. From the lipidomics analysis, it seems that the relative levels of the main neutral lipids in TANGO2-depleted cells remain unaltered (TAG) or even decrease (CE). Therefore, it would be interesting to explore further the increase in LD size for example analyze/display the absolute levels of neutral lipids in the various conditions.

      We agree with the reviewer and now present the absolute levels of lipids of interest in the various conditions of the lipidomics analyses (Figure S 3).

      3) Most of the lipidomics changes in TANGO2-depleted cells are observed in lipid species present in very low amounts while the relative abundance of major phospholipids (PC, PE PI) remains mostly unchanged. It would be good to also display the absolute levels of the various lipids analyzed. This is an important point to clarify as it would be unlikely that these major phospholipids are unaffected by an overall defect in Acyl-CoA metabolism, as proposed by the authors.

      As stated above, we have now included the absolute levels of lipids of interest in the various conditions of the lipidomics analyses (Figure S 3).

    1. Author Response:

      Reviewer #1 (Public Review):

      In this article, Bollmann and colleagues demonstrated both theoretically and experimentally that blood vessels could be targeted at the mesoscopic scale with time-of-flight magnetic resonance imaging (TOF-MRI). With a mathematical model that includes partial voluming effects explicitly, they outline how small voxels reduce the dependency of blood dwell time, a key parameter of the TOF sequence, on blood velocity. Through several experiments on three human subjects, they show that increasing resolution improves contrast and evaluate additional issues such as vessel displacement artifacts and the separation of veins and arteries.

      The overall presentation of the main finding, that small voxels are beneficial for mesoscopic pial vessels, is clear and well discussed, although difficult to grasp fully without a good prior understanding of the underlying TOF-MRI sequence principles. Results are convincing, and some of the data both raw and processed have been provided publicly. Visual inspection and comparisons of different scans are provided, although no quantification or statistical comparison of the results are included.

      Potential applications of the study are varied, from modeling more precisely functional MRI signals to assessing the health of small vessels. Overall, this article reopens a window on studying the vasculature of the human brain in great detail, for which studies have been surprisingly limited until recently.

      In summary, this article provides a clear demonstration that small pial vessels can indeed be imaged successfully with extremely high voxel resolution. There are however several concerns with the current manuscript, hopefully addressable within the study.

      Thank you very much for this encouraging review. While smaller voxel sizes theoretically benefit all blood vessels, we are specifically targeting the (small) pial arteries here, as the inflow-effect in veins is unreliable and susceptibility-based contrasts are much more suited for this part of the vasculature. (We have clarified this in the revised manuscript by substituting ‘vessel’ with ‘artery’ wherever appropriate.) Using a partial-volume model and a relative contrast formulation, we find that the blood delivery time is not the limiting factor when imaging pial arteries, but the voxel size is. Taking into account the comparatively fast blood velocities even in pial arteries with diameters ≤ 200 µm (using t_delivery=l_voxel/v_blood), we find that blood dwell times are sufficiently long for the small voxel sizes considered here to employ the simpler formulation of the flow-related enhancement effect. In other words, small voxels eliminate blood dwell time as a consideration for the blood velocities expected for pial arteries.

      We have extended the description of the TOF-MRA sequence in the revised manuscript, and all data and simulations/analyses presented in this manuscript are now publicly available at https://osf.io/nr6gc/ and https://gitlab.com/SaskiaB/pialvesseltof.git, respectively. This includes additional quantifications of the FRE effect for large vessels (adding to the assessment for small vessels already included), and the effect of voxel size on vessel segmentations.

      Main points:

      1) The manuscript needs clarifying through some additional background information for a readership wider than expert MR physicists. The TOF-MRA sequence and its underlying principles should be introduced first thing, even before discussing vascular anatomy, as it is the key to understanding what aspects of blood physiology and MRI parameters matter here. MR physics shorthand terms should be avoided or defined, as 'spins' or 'relaxation' are not obvious to everybody. The relationship between delivery time and slab thickness should be made clear as well.

      Thank you for this valuable comment that the Theory section is perhaps not accessible for all readers. We have adapted the manuscript in several locations to provide more background information and details on time-of-flight contrast. We found, however, that there is no concise way to first present the MR physics part and then introduce the pial arterial vasculature, as the optimization presented therein is targeted towards this structure. To address this comment, we have therefore opted to provide a brief introduction to TOF-MRA first in the Introduction, and then a more in-depth description in the Theory section.

      Introduction section:

      "Recent studies have shown the potential of time-of-flight (TOF) based magnetic resonance angiography (MRA) at 7 Tesla (T) in subcortical areas (Bouvy et al., 2016, 2014; Ladd, 2007; Mattern et al., 2018; Schulz et al., 2016; von Morze et al., 2007). In brief, TOF-MRA uses the high signal intensity caused by inflowing water protons in the blood to generate contrast, rather than an exogenous contrast agent. By adjusting the imaging parameters of a gradient-recalled echo (GRE) sequence, namely the repetition time (T_R) and flip angle, the signal from static tissue in the background can be suppressed, and high image intensities are only present in blood vessels freshly filled with non-saturated inflowing blood. As the blood flows through the vasculature within the imaging volume, its signal intensity slowly decreases. (For a comprehensive introduction to the principles of MRA, see for example Carr and Carroll (2012)). At ultra-high field, the increased signal-to-noise ratio (SNR), the longer T_1 relaxation times of blood and grey matter, and the potential for higher resolution are key benefits (von Morze et al., 2007)."

      Theory section:

      "Flow-related enhancement

      Before discussing the effects of vessel size, we briefly revisit the fundamental theory of the flow-related enhancement effect used in TOF-MRA. Taking into account the specific properties of pial arteries, we will then extend the classical description to this new regime. In general, TOF-MRA creates high signal intensities in arteries using inflowing blood as an endogenous contrast agent. The object magnetization—created through the interaction between the quantum mechanical spins of water protons and the magnetic field—provides the signal source (or magnetization) accessed via excitation with radiofrequency (RF) waves (called RF pulses) and the reception of ‘echo’ signals emitted by the sample around the same frequency. The T1-contrast in TOF-MRA is based on the difference in the steady-state magnetization of static tissue, which is continuously saturated by RF pulses during the imaging, and the increased or enhanced longitudinal magnetization of inflowing blood water spins, which have experienced no or few RF pulses. In other words, in TOF-MRA we see enhancement for blood that flows into the imaging volume."

      "Since the coverage or slab thickness in TOF-MRA is usually kept small to minimize blood delivery time by shortening the path-length of the vessel contained within the slab (Parker et al., 1991), and because we are focused here on the pial vasculature, we have limited our considerations to a maximum blood delivery time of 1000 ms, with values of few hundreds of milliseconds being more likely."

      2) The main discussion of higher resolution leading to improvements rather than loss presented here seems a bit one-sided: for a more objective understanding of the differences it would be worth to explicitly derive the 'classical' treatment and show how it leads to different conclusions than the present one. In particular, the link made in the discussion between using relative magnetization and modeling partial voluming seems unclear, as both are unrelated. One could also argue that in theory higher resolution imaging is always better, but of course there are practical considerations in play: SNR, dynamics of the measured effect vs speed of acquisition, motion, etc. These issues are not really integrated into the model, even though they provide strong constraints on what can be done. It would be good to at least discuss the constraints that 140 or 160 microns resolution imposes on what is achievable at present.

      Thank you for this excellent suggestion. We found it instructive to illustrate the different effects separately, i.e. relative vs. absolute FRE, and then partial volume vs. no-partial volume effects. In response to comment R2.8 of Reviewer 2, we also clarified the derivation of the relative FRE vs the ‘classical’ absolute FRE (please see R2.8). Accordingly, the manuscript now includes the theoretical derivation in the Theory section and an explicit demonstration of how the classical treatment leads to different conclusions in the Supplementary Material. The important insight gained in our work is that only when considering relative FRE and partial-volume effects together, can we conclude that smaller voxels are advantageous. We have added the following section in the Supplementary Material:

      "Effect of FRE Definition and Interaction with Partial-Volume Model

      For the definition of the FRE effect employed in this study, we used a measure of relative FRE (Al-Kwifi et al., 2002) in combination with a partial-volume model (Eq. 6). To illustrate the implications of these two effects, as well as their interaction, we have estimated the relative and absolute FRE for an artery with a diameter of 200 µm or 2 000 µm (i.e. no partial-volume effects at the centre of the vessel). The absolute FRE expression explicitly takes the voxel volume into account, and so instead of Eq. (6) for the relative FRE we used"

      Eq. (1)

      "Note that the division by M_zS^tissue⋅l_voxel^3 to obtain the relative FRE from this expression removes the contribution of the total voxel volume (l_voxel^3). Supplementary Figure 2 shows that, when partial volume effects are present, the highest relative FRE arises in voxels with the same size as or smaller than the vessel diameter (Supplementary Figure 2A), whereas the absolute FRE increases with voxel size (Supplementary Figure 2C). If no partial-volume effects are present, the relative FRE becomes independent of voxel size (Supplementary Figure 2B), whereas the absolute FRE increases with voxel size (Supplementary Figure 2D). While the partial-volume effects for the relative FRE are substantial, they are much more subtle when using the absolute FRE and do not alter the overall characteristics."

      Supplementary Figure 2: Effect of voxel size and blood delivery time on the relative flow-related enhancement (FRE) using either a relative (A,B) (Eq. (3)) or an absolute (C,D) (Eq. (12)) FRE definition assuming a pial artery diameter of 200 μm (A,C) or 2 000 µm, i.e. no partial-volume effects at the central voxel of this artery considered here.

      In addition, we have also clarified the contribution of the two definitions and their interaction in the Discussion section. Following the suggestion of Reviewer 2, we have extended our interpretation of relative FRE. In brief, absolute FRE is closely related to the physical origin of the contrast, whereas relative FRE is much more concerned with the “segmentability” of a vessel (please see R2.8 for more details):

      "Extending classical FRE treatments to the pial vasculature

      There are several major modifications in our approach to this topic that might explain why, in contrast to predictions from classical FRE treatments, it is indeed possible to image pial arteries. For instance, the definition of vessel contrast or flow-related enhancement is often stated as an absolute difference between blood and tissue signal (Brown et al., 2014a; Carr and Carroll, 2012; Du et al., 1993, 1996; Haacke et al., 1990; Venkatesan and Haacke, 1997). Here, however, we follow the approach of Al-Kwifi et al. (2002) and consider relative contrast. While this distinction may seem to be semantic, the effect of voxel volume on FRE for these two definitions is exactly opposite: Du et al. (1996) concluded that larger voxel size increases the (absolute) vessel-background contrast, whereas here we predict an increase in relative FRE for small arteries with decreasing voxel size. Therefore, predictions of the depiction of small arteries with decreasing voxel size differ depending on whether one is considering absolute contrast, i.e. difference in longitudinal magnetization, or relative contrast, i.e. contrast differences independent of total voxel size. Importantly, this prediction changes for large arteries where the voxel contains only vessel lumen, in which case the relative FRE remains constant across voxel sizes, but the absolute FRE increases with voxel size (Supplementary Figure 2). Overall, the interpretations of relative and absolute FRE differ, and one measure may be more appropriate for certain applications than the other. Absolute FRE describes the difference in magnetization and is thus tightly linked to the underlying physical mechanism. Relative FRE, however, describes the image contrast and segmentability. If blood and tissue magnetization are equal, both contrast measures would equal zero and indicate that no contrast difference is present. However, when there is signal in the vessel and as the tissue magnetization approaches zero, the absolute FRE approaches the blood magnetization (assuming no partial-volume effects), whereas the relative FRE approaches infinity. While this infinite relative FRE does not directly relate to the underlying physical process of ‘infinite’ signal enhancement through inflowing blood, it instead characterizes the segmentability of the image in that an image with zero intensity in the background and non-zero values in the structures of interest can be segmented perfectly and trivially. Accordingly, numerous empirical observations (Al-Kwifi et al., 2002; Bouvy et al., 2014; Haacke et al., 1990; Ladd, 2007; Mattern et al., 2018; von Morze et al., 2007) and the data provided here (Figure 5, 6 and 7) have shown the benefit of smaller voxel sizes if the aim is to visualize and segment small arteries."

      Note that our formulation of the FRE—even without considering SNR—does not suggest that higher resolution is always better, but instead should be matched to the size of the target arteries:

      "Importantly, note that our treatment of the FRE does not suggest that an arbitrarily small voxel size is needed, but instead that voxel sizes appropriate for the arterial diameter of interest are beneficial (in line with the classic “matched-filter” rationale (North, 1963)). Voxels smaller than the arterial diameter would not yield substantial benefits (Figure 5) and may result in SNR reductions that would hinder segmentation performance."

      Further, we have also extended the concluding paragraph of the Imaging limitation section to also include a practical perspective:

      "In summary, numerous theoretical and practical considerations remain for optimal imaging of pial arteries using time-of-flight contrast. Depending on the application, advanced displacement artefact compensation strategies may be required, and zero-filling could provide better vessel depiction. Further, an optimal trade-off between SNR, voxel size and acquisition time needs to be found. Currently, the partial-volume FRE model only considers voxel size, and—as we reduced the voxel size in the experiments—we (partially) compensated the reduction in SNR through longer scan times. This, ultimately, also required the use of prospective motion correction to enable the very long acquisition times necessary for 140 µm isotropic voxel size. Often, anisotropic voxels are used to reduce acquisition time and increase SNR while maintaining in-plane resolution. This may indeed prove advantageous when the (also highly anisotropic) arteries align with the anisotropic acquisition, e.g. when imaging the large supplying arteries oriented mostly in the head-foot direction. In the case of pial arteries, however, there is not preferred orientation because of the convoluted nature of the pial arterial vasculature encapsulating the complex folding of the cortex (see section Anatomical architecture of the pial arterial vasculature). A further reduction in voxel size may be possible in dedicated research settings utilizing even longer acquisition times and/or larger acquisition volumes to maintain SNR. However, if acquisition time is limited, voxel size and SNR need to be carefully balanced against each other."

      3) The article seems to imply that TOF-MRA is the only adequate technique to image brain vasculature, while T2 mapping, UHF T1 mapping (see e.g. Choi et al., https://doi.org/10.1016/j.neuroimage.2020.117259) phase (e.g. Fan et al., doi:10.1038/jcbfm.2014.187), QSM (see e.g. Huck et al., https://doi.org/10.1007/s00429-019-01919-4), or a combination (Bernier et al., https://doi.org/10.1002/hbm.24337​, Ward et al., https://doi.org/10.1016/j.neuroimage.2017.10.049) all depict some level of vascular detail. It would be worth quickly reviewing the different effects of blood on MRI contrast and how those have been used in different approaches to measure vasculature. This would in particular help clarify the experiment combining TOF with T2 mapping used to separate arteries from veins (more on this question below).

      We apologize if we inadvertently created the impression that TOF-MRA is a suitable technique to image the complete brain vasculature, and we agree that susceptibility-based methods are much more suitable for venous structures. As outlined above, we have revised the manuscript in various sections to indicate that it is the pial arterial vasculature we are targeting. We have added a statement on imaging the venous vasculature in the Discussion section. Please see our response below regarding the use of T2* to separate arteries and veins.

      "The advantages of imaging the pial arterial vasculature using TOF-MRA without an exogenous contrast agent lie in its non-invasiveness and the potential to combine these data with various other structural and functional image contrasts provided by MRI. One common application is to acquire a velocity-encoded contrast such as phase-contrast MRA (Arts et al., 2021; Bouvy et al., 2016). Another interesting approach utilises the inherent time-of-flight contrast in magnetization-prepared two rapid acquisition gradient echo (MP2RAGE) images acquired at ultra-high field that simultaneously acquires vasculature and structural data, albeit at lower achievable resolution and lower FRE compared to the TOF-MRA data in our study (Choi et al., 2020). In summary, we expect high-resolution TOF-MRA to be applicable also for group studies to address numerous questions regarding the relationship of arterial topology and morphometry to the anatomical and functional organization of the brain, and the influence of arterial topology and morphometry on brain hemodynamics in humans. In addition, imaging of the pial venous vasculature—using susceptibility-based contrasts such as T2-weighted magnitude (Gulban et al., 2021) or phase imaging (Fan et al., 2015), susceptibility-weighted imaging (SWI) (Eckstein et al., 2021; Reichenbach et al., 1997) or quantitative susceptibility mapping (QSM) (Bernier et al., 2018; Huck et al., 2019; Mattern et al., 2019; Ward et al., 2018)—would enable a comprehensive assessment of the complete cortical vasculature and how both arteries and veins shape brain hemodynamics.*"

      4) The results, while very impressive, are mostly qualitative. This seems a missed opportunity to strengthen the points of the paper: given the segmentations already made, the amount/density of detected vessels could be compared across scans for the data of Fig. 5 and 7. The minimum distance between vessels could be measured in Fig. 8 to show a 2D distribution and/or a spatial map of the displacement. The number of vessels labeled as veins instead of arteries in Fig. 9 could be given.

      We fully agree that estimating these quantitative measures would be very interesting; however, this would require the development of a comprehensive analysis framework, which would considerably shift the focus of this paper from data acquisition and flow-related enhancement to data analysis. As noted in the discussion section Challenges for vessel segmentation algorithms, ‘The vessel segmentations presented here were performed to illustrate the sensitivity of the image acquisition to small pial arteries’, because the smallest arteries tend to be concealed in the maximum intensity projections. Further, the interpretation of these measures is not straightforward. For example, the number of detected vessels for the artery depicted in Figure 5 does not change across resolutions, but their length does. We have therefore estimated the relative increase in skeleton length across resolutions for Figures 5 and 7. However, these estimates are not only a function of the voxel size but also of the underlying vasculature, i.e. the number of arteries with a certain diameter present, and may thus not generalise well to enable quantitative predictions of the improvement expected from increased resolutions. We have added an illustration of these analyses in the Supplementary Material, and the following additions in the Methods, Results and Discussion sections.

      "For vessel segmentation, a semi-automatic segmentation pipeline was implemented in Matlab R2020a (The MathWorks, Natick, MA) using the UniQC toolbox (Frässle et al., 2021): First, a brain mask was created through thresholding which was then manually corrected in ITK-SNAP (http://www.itksnap.org/) (Yushkevich et al., 2006) such that pial vessels were included. For the high-resolution TOF data (Figures 6 and 7, Supplementary Figure 4), denoising to remove high frequency noise was performed using the implementation of an adaptive non-local means denoising algorithm (Manjón et al., 2010) provided in DenoiseImage within the ANTs toolbox, with the search radius for the denoising set to 5 voxels and noise type set to Rician. Next, the brain mask was applied to the bias corrected and denoised data (if applicable). Then, a vessel mask was created based on a manually defined threshold, and clusters with less than 10 or 5 voxels for the high- and low-resolution acquisitions, respectively, were removed from the vessel mask. Finally, an iterative region-growing procedure starting at each voxel of the initial vessel mask was applied that successively included additional voxels into the vessel mask if they were connected to a voxel which was already included and above a manually defined threshold (which was slightly lower than the previous threshold). Both thresholds were applied globally but manually adjusted for each slab. No correction for motion between slabs was applied. The Matlab code describing the segmentation algorithm as well as the analysis of the two-echo TOF acquisition outlined in the following paragraph are also included in our github repository (https://gitlab.com/SaskiaB/pialvesseltof.git). To assess the data quality, maximum intensity projections (MIPs) were created and the outline of the segmentation MIPs were added as an overlay. To estimate the increased detection of vessels with higher resolutions, we computed the relative increase in the length of the segmented vessels for the data presented in Figure 5 (0.8 mm, 0.5 mm, 0.4 mm and 0.3 mm isotropic voxel size) and Figure 7 (0.16 mm and 0.14 mm isotropic voxel size) by computing the skeleton using the bwskel Matlab function and then calculating the skeleton length as the number of voxels in the skeleton multiplied by the voxel size."

      "To investigate the effect of voxel size on vessel FRE, we acquired data at four different voxel sizes ranging from 0.8 mm to 0.3 mm isotropic resolution, adjusting only the encoding matrix, with imaging parameters being otherwise identical (FOV, TR, TE, flip angle, R, slab thickness, see section Data acquisition). The total acquisition time increases from less than 2 minutes for the lowest resolution scan to over 6 minutes for the highest resolution scan as a result. Figure 5 shows thin maximum intensity projections of a small vessel. While the vessel is not detectable at the largest voxel size, it slowly emerges as the voxel size decreases and approaches the vessel size. Presumably, this is driven by the considerable increase in FRE as seen in the single slice view (Figure 5, small inserts). Accordingly, the FRE computed from the vessel mask for the smallest part of the vessel (Figure 5, red mask) increases substantially with decreasing voxel size. More precisely, reducing the voxel size from 0.8 mm, 0.5 mm or 0.4 mm to 0.3 mm increases the FRE by 2900 %, 165 % and 85 %, respectively. Assuming a vessel diameter of 300 μm, the partial-volume FRE model (section Introducing a partial-volume model) would predict similar ratios of 611%, 178% and 78%. However, as long as the vessel is larger than the voxel (Figure 5, blue mask), the relative FRE does not change with resolution (see also Effect of FRE Definition and Interaction with Partial-Volume Model in the Supplementary Material). To illustrate the gain in sensitivity to detect smaller arteries, we have estimated the relative increase of the total length of the segmented vasculature (Supplementary Figure 9): reducing the voxel size from 0.8 mm to 0.5 mm isotropic increases the skeleton length by 44 %, reducing the voxel size from 0.5 mm to 0.4 mm isotropic increases the skeleton length by 28 %, and reducing the voxel size from 0.4 mm to 0.3 mm isotropic increases the skeleton length by 31 %. In summary, when imaging small pial arteries, these data support the hypothesis that it is primarily the voxel size, not the blood delivery time, which determines whether vessels can be resolved."

      "Indeed, the reduction in voxel volume by 33 % revealed additional small branches connected to larger arteries (see also Supplementary Figure 8). For this example, we found an overall increase in skeleton length of 14 % (see also Supplementary Figure 9)."

      "We therefore expect this strategy to enable an efficient image acquisition without the need for additional venous suppression RF pulses. Once these challenges for vessel segmentation algorithms are addressed, a thorough quantification of the arterial vasculature can be performed. For example, the skeletonization procedure used to estimate the increase of the total length of the segmented vasculature (Supplementary Figure 9) exhibits errors particularly in the unwanted sinuses and large veins. While they are consistently present across voxel sizes, and thus may have less impact on relative change in skeleton length, they need to be addressed when estimating the absolute length of the vasculature, or other higher-order features such as number of new branches. (Note that we have also performed the skeletonization procedure on the maximum intensity projections to reduce the number of artefacts and obtained comparable results: reducing the voxel size from 0.8 mm to 0.5 mm isotropic increases the skeleton length by 44 % (3D) vs 37 % (2D), reducing the voxel size from 0.5 mm to 0.4 mm isotropic increases the skeleton length by 28 % (3D) vs 26 % (2D), reducing the voxel size from 0.4 mm to 0.3 mm isotropic increases the skeleton length by 31 % (3D) vs 16 % (2D), and reducing the voxel size from 0.16 mm to 0.14 mm isotropic increases the skeleton length by 14 % (3D) vs 24 % (2D).)"

      Supplementary Figure 9: Increase of vessel skeleton length with voxel size reduction. Axial maximum intensity projections for data acquired with different voxel sizes ranging from 0.8 mm to 0.3 mm (TOP) (corresponding to Figure 5) and 0.16 mm to 0.14 mm isotropic (corresponding to Figure 7) are shown. Vessel skeletons derived from segmentations performed for each resolution are overlaid in red. A reduction in voxel size is accompanied by a corresponding increase in vessel skeleton length.

      Regarding further quantification of the vessel displacement presented in Figure 8, we have estimated the displacement using the Horn-Schunck optical flow estimator (Horn and Schunck, 1981; Mustafa, 2016) (https://github.com/Mustafa3946/Horn-Schunck-3D-Optical-Flow). However, the results are dominated by the larger arteries, whereas we are mostly interested in the displacement of the smallest arteries, therefore this quantification may not be helpful.

      Because the theoretical relationship between vessel displacement and blood velocity is well known (Eq. 7), and we have also outlined the expected blood velocity as a function of arterial diameter in Figure 2, which provided estimates of displacements that matched what was found in our data (as reported in our original submission), we believe that the new quantification in this form does not add value to the manuscript. What would be interesting would be to explore the use of this displacement artefact as a measure of blood velocities. This, however, would require more substantial analyses in particular for estimation of the arterial diameter and additional validation data (e.g. phase-contrast MRA). We have outlined this avenue in the Discussion section. What is relevant to the main aim of this study, namely imaging of small pial arteries, is the insight that blood velocities are indeed sufficiently fast to cause displacement artefacts even in smaller arteries. We have clarified this in the Results section:

      "Note that correction techniques exist to remove displaced vessels from the image (Gulban et al., 2021), but they cannot revert the vessels to their original location. Alternatively, this artefact could also potentially be utilised as a rough measure of blood velocity."

      "At a delay time of 10 ms between phase encoding and echo time, the observed displacement of approximately 2 mm in some of the larger vessels would correspond to a blood velocity of 200 mm/s, which is well within the expected range (Figure 2). For the smallest arteries, a displacement of one voxel (0.4 mm) can be observed, indicative of blood velocities of 40 mm/s. Note that the vessel displacement can be observed in all vessels visible at this resolution, indicating high blood velocities throughout much of the pial arterial vasculature. Thus, assuming a blood velocity of 40 mm/s (Figure 2) and a delay time of 5 ms for the high-resolution acquisitions (Figure 6), vessel displacements of 0.2 mm are possible, representing a shift of 1–2 voxels."

      Regarding the number of vessels labelled as veins, please see our response below to R1.5.

      In the main quantification given, the estimation of FRE increase with resolution, it would make more sense to perform the segmentation independently for each scan and estimate the corresponding FRE: using the mask from the highest resolution scan only biases the results. It is unclear also if the background tissue measurement one voxel outside took partial voluming into account (by leaving a one voxel free interface between vessel and background). In this analysis, it would also be interesting to estimate SNR, so you can compare SNR and FRE across resolutions, also helpful for the discussion on SNR.

      The FRE serves as an indicator of the potential performance of any segmentation algorithm (including manual segmentation) (also see our discussion on the interpretation of FRE in our response to R1.2). If we were to segment each scan individually, we would, in the ideal case, always obtain the same FRE estimate, as FRE influences the performance of the segmentation algorithm. In practice, this simply means that it is not possible to segment the vessel in the low-resolution image to its full extent that is visible in the high-resolution image, because the FRE is too low for small vessels. However, we agree with the core point that the reviewer is making, and so to help address this, a valuable addition would be to compare the FRE for the section of a vessel that is visible at all resolutions, where we found—within the accuracy of the transformations and resampling across such vastly different resolutions—that the FRE does not increase any further with higher resolution if the vessel is larger than the voxel size (page 18 and Figure 5). As stated in the Methods section, and as noted by the reviewer, we used the voxels immediately next to the vessel mask to define the background tissue signal level. Any resulting potential partial-volume effects in these background voxels would affect all voxel sizes, introducing a consistent bias that would not impact our comparison. However, inspection of the image data in Figure 5 showed partial-volume effects predominantly within those voxels intersecting the vessel, rather than voxels surrounding the vessel, in agreement with our model of FRE.

      "All imaging data were slab-wise bias-field corrected using the N4BiasFieldCorrection (Tustison et al., 2010) tool in ANTs (Avants et al., 2009) with the default parameters. To compare the empirical FRE across the four different resolutions (Figure 5), manual masks were first created for the smallest part of the vessel in the image with the highest resolution and for the largest part of the vessel in the image with the lowest resolution. Then, rigid-body transformation parameters from the low-resolution to the high-resolution (and the high-resolution to the low-resolution) images were estimated using coregister in SPM (https://www.fil.ion.ucl.ac.uk/spm/), and their inverse was applied to the vessel mask using SPM’s reslice. To calculate the empirical FRE (Eq. (3)), the mean of the intensity values within the vessel mask was used to approximate the blood magnetization, and the mean of the intensity values one voxel outside of the vessel mask was used as the tissue magnetization."

      "To investigate the effect of voxel size on vessel FRE, we acquired data at four different voxel sizes ranging from 0.8 mm to 0.3 mm isotropic resolution, adjusting only the encoding matrix, with imaging parameters being otherwise identical (FOV, TR, TE, flip angle, R, slab thickness, see section Data acquisition). The total acquisition time increases from less than 2 minutes for the lowest resolution scan to over 6 minutes for the highest resolution scan as a result. Figure 5 shows thin maximum intensity projections of a small vessel. While the vessel is not detectable at the largest voxel size, it slowly emerges as the voxel size decreases and approaches the vessel size. Presumably, this is driven by the considerable increase in FRE as seen in the single slice view (Figure 5, small inserts). Accordingly, the FRE computed from the vessel mask for the smallest part of the vessel (Figure 5, red mask) increases substantially with decreasing voxel size. More precisely, reducing the voxel size from 0.8 mm, 0.5 mm or 0.4 mm to 0.3 mm increases the FRE by 2900 %, 165 % and 85 %, respectively. Assuming a vessel diameter of 300 μm, the partial-volume FRE model (section Introducing a partial-volume model) would predict similar ratios of 611%, 178% and 78%. However, if the vessel is larger than the voxel (Figure 5, blue mask), the relative FRE remains constant across resolutions (see also Effect of FRE Definition and Interaction with Partial-Volume Model in the Supplementary Material). To illustrate the gain in sensitivity to smaller arteries, we have estimated the relative increase of the total length of the segmented vasculature (Supplementary Figure 9): reducing the voxel size from 0.8 mm to 0.5 mm isotropic increases the skeleton length by 44 %, reducing the voxel size from 0.5 mm to 0.4 mm isotropic increases the skeleton length by 28 %, and reducing the voxel size from 0.4 mm to 0.3 mm isotropic increases the skeleton length by 31 %. In summary, when imaging small pial arteries, these data support the hypothesis that it is primarily the voxel size, not blood delivery time, which determines whether vessels can be resolved."

      Figure 5: Effect of voxel size on flow-related vessel enhancement. Thin axial maximum intensity projections containing a small artery acquired with different voxel sizes ranging from 0.8 mm to 0.3 mm isotropic are shown. The FRE is estimated using the mean intensity value within the vessel masks depicted on the left, and the mean intensity values of the surrounding tissue. The small insert shows a section of the artery as it lies within a single slice. A reduction in voxel size is accompanied by a corresponding increase in FRE (red mask), whereas no further increase is obtained once the voxel size is equal or smaller than the vessel size (blue mask).

      After many internal discussions, we had to conclude that deducing a meaningful SNR analysis that would benefit the reader was not possible given the available data due to the complex relationship between voxel size and other imaging parameters in practice. In detail, we have reduced the voxel size but at the same time increased the acquisition time by increasing the number of encoding steps—which we have now also highlighted in the manuscript. We have, however, added additional considerations about balancing SNR and segmentation performance. Note that these considerations are not specific to imaging the pial arteries but apply to all MRA acquisitions, and have thus been discussed previously in the literature. Here, we wanted to focus on the novel insights gained in our study. Importantly, while we previously noted that reducing voxel size improves contrast in vessels whose diameters are smaller than the voxel size, we now explicitly acknowledge that, for vessels whose diameters are larger than the voxel size reducing the voxel size is not helpful---since it only reduces SNR without any gain in contrast---and may hinder segmentation performance, and thus become counterproductive.

      "In general, we have not considered SNR, but only FRE, i.e. the (relative) image contrast, assuming that segmentation algorithms would benefit from higher contrast for smaller arteries. Importantly, the acquisition parameters available to maximize FRE are limited, namely repetition time, flip angle and voxel size. SNR, however, can be improved via numerous avenues independent of these parameters (Brown et al., 2014b; Du et al., 1996; Heverhagen et al., 2008; Parker et al., 1991; Triantafyllou et al., 2011; Venkatesan and Haacke, 1997), the simplest being longer acquisition times. If the aim is to optimize a segmentation outcome for a given acquisition time, the trade-off between contrast and SNR for the specific segmentation algorithm needs to be determined (Klepaczko et al., 2016; Lesage et al., 2009; Moccia et al., 2018; Phellan and Forkert, 2017). Our own—albeit limited—experience has shown that segmentation algorithms (including manual segmentation) can accommodate a perhaps surprising amount of noise using prior knowledge and neighborhood information, making these high-resolution acquisitions possible. Importantly, note that our treatment of the FRE does not suggest that an arbitrarily small voxel size is needed, but instead that voxel sizes appropriate for the arterial diameter of interest are beneficial (in line with the classic “matched-filter” rationale (North, 1963)). Voxels smaller than the arterial diameter would not yield substantial benefits (Figure 5) and may result in SNR reductions that would hinder segmentation performance."

      5) The separation of arterial and venous components is a bit puzzling, partly because the methodology used is not fully explained, but also partly because the reasons invoked (flow artefact in large pial veins) do not match the results (many small vessels are included as veins). This question of separating both types of vessels is quite important for applications, so the whole procedure should be explained in detail. The use of short T2 seemed also sub-optimal, as both arteries and veins result in shorter T2 compared to most brain tissues: wouldn't a susceptibility-based measure (SWI or better QSM) provide a better separation? Finally, since the T2* map and the regular TOF map are at different resolutions, masking out the vessels labeled as veins will likely result in the smaller veins being left out.

      We agree that while the technical details of this approach were provided in the Data analysis section, the rationale behind it was only briefly mentioned. We have therefore included an additional section Inflow-artefacts in sinuses and pial veins in the Theory section of the manuscript. We have also extended the discussion of the advantages and disadvantages of the different susceptibility-based contrasts, namely T2, SWI and QSM. While in theory both T2 and QSM should allow the reliable differentiation of arterial and venous blood, we found T2* to perform more robustly, as QSM can fail in many places, e.g., due to the strong susceptibility sources within superior sagittal and transversal sinuses and pial veins and their proximity to the brain surface, dedicated processing is required (Stewart et al., 2022). Further, we have also elaborated in the Discussion section why the interpretation of Figure 9 regarding the absence or presence of small veins is challenging. Namely, the intensity-based segmentation used here provides only an incomplete segmentation even of the larger sinuses, because the overall lower intensity found in veins combined with the heterogeneity of the intensities in veins violates the assumptions made by most vascular segmentation approaches of homogenous, high image intensities within vessels, which are satisfied in arteries (page 29f) (see also the illustration below). Accordingly, quantifying the number of vessels labelled as veins (R1.4a) would provide misleading results, as often only small subsets of the same sinus or vein are segmented.

      "Inflow-artefacts in sinuses and pial veins

      Inflow in large pial veins and the sagittal and transverse sinuses can cause flow-related enhancement in these non-arterial vessels. One common strategy to remove this unwanted signal enhancement is to apply venous suppression pulses during the data acquisition, which saturate bloods spins outside the imaging slab. Disadvantages of this technique are the technical challenges of applying these pulses at ultra-high field due to constraints of the specific absorption rate (SAR) and the necessary increase in acquisition time (Conolly et al., 1988; Heverhagen et al., 2008; Johst et al., 2012; Maderwald et al., 2008; Schmitter et al., 2012; Zhang et al., 2015). In addition, optimal positioning of the saturation slab in the case of pial arteries requires further investigation, and in particular supressing signal from the superior sagittal sinus without interfering in the imaging of the pial arteries vasculature at the top of the cortex might prove challenging. Furthermore, this venous saturation strategy is based on the assumption that arterial blood is traveling head-wards while venous blood is drained foot-wards. For the complex and convoluted trajectory of pial vessels this directionality-based saturation might be oversimplified, particularly when considering the higher-order branches of the pial arteries and veins on the cortical surface. Inspired by techniques to simultaneously acquire a TOF image for angiography and a susceptibility-weighted image for venography (Bae et al., 2010; Deistung et al., 2009; Du et al., 1994; Du and Jin, 2008), we set out to explore the possibility of removing unwanted venous structures from the segmentation of the pial arterial vasculature during data postprocessing. Because arteries filled with oxygenated blood have T2-values similar to tissue, while veins have much shorter T2-values due to the presence of deoxygenated blood (Pauling and Coryell, 1936; Peters et al., 2007; Uludağ et al., 2009; Zhao et al., 2007), we used this criterion to remove vessels with short T2* values from the segmentation (see Data Analysis for details). In addition, we also explored whether unwanted venous structures in the high-resolution TOF images—where a two-echo acquisition is not feasible due to the longer readout—can be removed based on detecting them in a lower-resolution image."

      "Removal of pial veins

      Inflow in large pial veins and the superior sagittal and transverse sinuses can cause a flow-related enhancement in these non-arterial vessels (Figure 9, left). The higher concentration of deoxygenated haemoglobin in these vessels leads to shorter T2 values (Pauling and Coryell, 1936), which can be estimated using a two-echo TOF acquisition (see also Inflow-artefacts in sinuses and pial veins). These vessels can be identified in the segmentation based on their T2 values (Figure 9, left), and removed from the angiogram (Figure 9, right) (Bae et al., 2010; Deistung et al., 2009; Du et al., 1994; Du and Jin, 2008). In particular, the superior and inferior sagittal and the transversal sinuses and large veins which exhibited an inhomogeneous intensity profile and a steep loss of intensity at the slab boundary were identified as non-arterial (Figure 9, left). Further, we also explored the option of removing unwanted venous vessels from the high-resolution TOF image (Figure 7) using a low-resolution two-echo TOF (not shown). This indeed allowed us to remove the strong signal enhancement in the sagittal sinuses and numerous larger veins, although some small veins, which are characterised by inhomogeneous intensity profiles and can be detected visually by experienced raters, remain."

      Figure 9: Removal of non-arterial vessels in time-of-flight imaging. LEFT: Segmentation of arteries (red) and veins (blue) using T_2^ estimates. RIGHT: Time-of-flight angiogram after vein removal.*

      Our approach also assumes that the unwanted veins are large enough that they are also resolved in the low-resolution image. If we consider the source of the FRE effect, it might indeed be exclusively large veins that are present in TOF-MRA data, which would suggest that our assumption is valid. Fundamentally, the FRE depends on the inflow of un-saturated spins into the imaging slab. However, small veins drain capillary beds in the local tissue, i.e. the tissue within the slab. (Note that due to the slice oversampling implemented in our acquisition, spins just above or below the slab will also be excited.) Thus, small veins only contain blood water spins that have experienced a large number of RF pulses due to the long transit time through the pial arterial vasculature, the capillaries and the intracortical venules. Hence, their longitudinal magnetization would be similar to that of stationary tissue. To generate an FRE effect in veins, “pass-through” venous blood from outside the imaging slab is required. This is only available in veins that are passing through the imaging slab, which have much larger diameters. These theoretical considerations are corroborated by the findings in Figure 9, where large disconnected vessels with varying intensity profiles were identified as non-arterial. Due to the heterogenous intensity profiles in large veins and the sagittal and transversal sinuses, the intensity-based segmentation applied here may only label a subset of the vessel lumen, creating the impression of many small veins. This is particularly the case for the straight and inferior sagittal sinus in the bottom slab of Figure 9. Nevertheless, future studies potentially combing anatomical prior knowledge, advanced segmentation algorithms and susceptibility measures would be capable of removing these unwanted veins in post-processing to enable an efficient TOF-MRA image acquisition dedicated to optimally detecting small arteries without the need for additional venous suppression RF pulses.

      6) A more general question also is why this imaging method is limited to pial vessels: at 140 microns, the larger intra-cortical vessels should be appearing (group 6 in Duvernoy, 1981: diameters between 50 and 240 microns). Are there other reasons these vessels are not detected? Similarly, it seems there is no arterial vasculature detected in the white matter here: it is due to the rather superior location of the imaging slab, or a limitation of the method? Likewise, all three results focus on a rather homogeneous region of cerebral cortex, in terms of vascularisation. It would be interesting for applications to demonstrate the capabilities of the method in more complex regions, e.g. the densely vascularised cerebellum, or more heterogeneous regions like the midbrain. Finally, it is notable that all three subjects appear to have rather different densities of vessels, from sparse (participant II) to dense (participant I), with some inhomogeneities in density (frontal region in participant III) and inconsistencies in detection (sinuses absent in participant II). All these points should be discussed.

      While we are aware that the diameter of intracortical arteries has been suggested to be up to 240 µm (Duvernoy et al., 1981), it remains unclear how prevalent intracortical arteries of this size are. For example, note that in a different context in the Duvernoy study (in teh revised manuscript), the following values are mentioned (which we followed in Figure 1):

      “Central arteries of the Iobule always have a large diameter of 260 µ to 280 µ, at their origin. Peripheral arteries have an average diameter of 150 µ to 180 µ. At the cortex surface, all arterioles of 50 µ or less, penetrate the cortex or form anastomoses. The diameter of most of these penetrating arteries is approximately 40 µ.”

      Further, the examinations by Hirsch et al. (2012) (albeit in the macaque brain), showed one (exemplary) intracortical artery belonging to group 6 (Figure 1B), whose diameter appears to be below 100 µm. Given these discrepancies and the fact that intracortical arteries in group 5 only reach 75 µm, we suspect that intracortical arteries with diameters > 140 µm are a very rare occurrence, which we might not have encountered in this data set.

      Similarly, arteries in white matter (Nonaka et al., 2003) and the cerebellum (Duvernoy et al., 1983) are beyond our resolution at the moment. The midbrain is an interesting suggesting, although we believe that the cortical areas chosen here with their gradual reduction in diameter along the vascular tree, provide a better illustration of the effect of voxel size than the rather abrupt reduction in vascular diameter found in the midbrain. We have added the even higher resolution requirements in the discussion section:

      "In summary, we expect high-resolution TOF-MRA to be applicable also for group studies, to address numerous questions regarding the relationship of arterial topology and morphometry to the anatomical and functional organization of the brain, and the influence of arterial topology and morphometry on brain hemodynamics in humans. Notably, we have focused on imaging pial arteries of the human cerebrum; however, other brain structures such as the cerebellum, subcortex and white matter are of course also of interest. While the same theoretical considerations apply, imaging the arterial vasculature in these structures will require even smaller voxel sizes due to their smaller arterial diameters (Duvernoy et al., 1983, 1981; Nonaka et al., 2003)."

      Regarding the apparent sparsity of results from participant II, this is mostly driven by the much smaller coverage in this subject (19.6 mm in Participant II vs. 50 mm and 58 mm in Participant I and III, respectively). The reduction in density in the frontal regions might indeed constitute difference in anatomy or might be driven by the presence or more false-positive veins in Participant I than Participant III in these areas. Following the depiction in Duvernoy et al. (1981), one would not expect large arteries in frontal areas, but large veins are common. Thus, the additional vessels in Participant I in the frontal areas might well be false-positive veins, and their removal would result in similar densities for both participants. Indeed, as pointed out in section Future directions, we would expect a lower arterial density in frontal and posterior areas than in middle areas. The sinuses (and other large false-positive veins) in Participant II have been removed as outlined and discussed in sections Removal of pial veins and Challenges for vessel segmentation algorithms, respectively.

      7) One of the main practical limitations of the proposed method is the use of a very small imaging slab. It is mentioned in the discussion that thicker slabs are not only possible, but beneficial both in terms of SNR and acceleration possibilities. What are the limitations that prevented their use in the present study? With the current approach, what would be the estimated time needed to acquire the vascular map of an entire brain? It would also be good to indicate whether specific processing was needed to stitch together the multiple slab images in Fig. 6-9, S2.

      Time-of-flight acquisitions are commonly performed with thin acquisition slabs, following initial investigations by Parker et al. (1991) to maximise vessel sensitivity and minimize noise. We therefore followed this practice for our initial investigations but wanted to point out in the discussion that thicker slabs might provide several advantages that need to be evaluated in future studies. This would include theoretical and empirical evaluations balancing SNR gains from larger excitation volumes and SNR losses due to more acceleration. For this study, we have chosen the slab thickness such as to keep the acquisition time at a reasonable amount to minimize motion artefacts (as outlined in the Discussion). In addition, due to the extreme matrix sizes in particular for the 0.14 mm acquisition, we were also limited in the number of data points per image that can be indexed. This would require even more substantial changes to the sequence than what we have already performed. With 16 slabs, assuming optimal FOV orientation, full-brain coverage including the cerebellum of 95 % of the population (Mennes et al., 2014) could be achieved with an acquisition time of (16  11 min 42 s = 3 h 7 min 12 s) at 0.16 mm isotropic voxel size. No stitching of the individual slabs was performed, as subject motion was minimal. We have added a corresponding comment in the Data Analysis.

      "Both thresholds were applied globally but manually adjusted for each slab. No correction for motion between slabs was applied as subject motion was minimal. The Matlab code describing the segmentation algorithm as well es the analysis of the two-echo TOF acquisition outlined in the following paragraph are also included in the github repository (https://gitlab.com/SaskiaB/pialvesseltof.git)."

      8) Some researchers and clinicians will argue that you can attain best results with anisotropic voxels, combining higher SNR and higher resolution. It would be good to briefly mention why isotropic voxels are preferred here, and whether anisotropic voxels would make sense at all in this context.

      Anisotropic voxels can be advantageous if the underlying object is anisotropic, e.g. an artery running straight through the slab, which would have a certain diameter (imaged using the high-resolution plane) and an ‘infinite’ elongation (in the low-resolution direction). However, the vessels targeted here can have any orientation and curvature; an anisotropic acquisition could therefore introduce a bias favouring vessels with a particular orientation relative to the voxel grid. Note that the same argument applies when answering the question why a further reduction slab thickness would eventually result in less increase in FRE (section Introducing a partial-volume model). We have added a corresponding comment in our discussion on practical imaging considerations:

      "In summary, numerous theoretical and practical considerations remain for optimal imaging of pial arteries using time-of-flight contrast. Depending on the application, advanced displacement artefact compensation strategies may be required, and zero-filling could provide better vessel depiction. Further, an optimal trade-off between SNR, voxel size and acquisition time needs to be found. Currently, the partial-volume FRE model only considers voxel size, and—as we reduced the voxel size in the experiments—we (partially) compensated the reduction in SNR through longer scan times. This, ultimately, also required the use of prospective motion correction to enable the very long acquisition times necessary for 140 µm isotropic voxel size. Often, anisotropic voxels are used to reduce acquisition time and increase SNR while maintaining in-plane resolution. This may indeed prove advantageous when the (also highly anisotropic) arteries align with the anisotropic acquisition, e.g. when imaging the large supplying arteries oriented mostly in the head-foot direction. In the case of pial arteries, however, there is not preferred orientation because of the convoluted nature of the pial arterial vasculature encapsulating the complex folding of the cortex (see section Anatomical architecture of the pial arterial vasculature). A further reduction in voxel size may be possible in dedicated research settings utilizing even longer acquisition times and a larger field-of-view to maintain SNR. However, if acquisition time is limited, voxel size and SNR need to be carefully balanced against each other."

      Reviewer #2 (Public Review):

      Overview

      This paper explores the use of inflow contrast MRI for imaging the pial arteries. The paper begins by providing a thorough background description of pial arteries, including past studies investigating the velocity and diameter. Following this, the authors consider this information to optimize the contrast between pial arteries and background tissue. This analysis reveals spatial resolution to be a strong factor influencing the contrast of the pial arteries. Finally, experiments are performed on a 7T MRI to investigate: the effect of spatial resolution by acquiring images at multiple resolutions, demonstrate the feasibility of acquiring ultrahigh resolution 3D TOF, the effect of displacement artifacts, and the prospect of using T2* to remove venous voxels.

      Impression

      There is certainly interest in tools to improve our understanding of the architecture of the small vessels of the brain and this work does address this. The background description of the pial arteries is very complete and the manuscript is very well prepared. The images are also extremely impressive, likely benefiting from motion correction, 7T, and a very long scan time. The authors also commit to open science and provide the data in an open platform. Given this, I do feel the manuscript to be of value to the community; however, there are concerns with the methods for optimization, the qualitative nature of the experiments, and conclusions drawn from some of the experiments.

      Specific Comments :

      1) Figure 3 and Theory surrounding. The optimization shown in Figure 3 is based fixing the flip angle or the TR. As is well described in the literature, there is a strong interdependency of flip angle and TR. This is all well described in literature dating back to the early 90s. While I think it reasonable to consider these effects in optimization, the language needs to include this interdependency or simply reference past work and specify how the flip angle was chosen. The human experiments do not include any investigation of flip angle or TR optimization.

      We thank the reviewer for raising this valuable point, and we fully agree that there is an interdependency between these two parameters. To simplify our optimization, we did fix one parameter value at a time, but in the revised manuscript we clarified that both parameters can be optimized simultaneously. Importantly, a large range of parameter values will result in a similar FRE in the small artery regime, which is illustrated in the optimization provided in the main text. We have therefore chosen the repetition time based on encoding efficiency and then set a corresponding excitation flip angle. In addition, we have also provided additional simulations in the supplementary material outlining the interdependency for the case of pial arteries.

      "Optimization of repetition time and excitation flip angle

      As the main goal of the optimisation here was to start within an already established parameter range for TOF imaging at ultra-high field (Kang et al., 2010; Stamm et al., 2013; von Morze et al., 2007), we only needed to then further tailor these for small arteries by considering a third parameter, namely the blood delivery time. From a practical perspective, a TR of 20 ms as a reference point was favourable, as it offered a time-efficient readout minimizing wait times between excitations but allowing low encoding bandwidths to maximize SNR. Due to the interdependency of flip angle and repetition time, for any one blood delivery time any FRE could (in theory) be achieved. For example, a similar FRE curve at 18 ° flip angle and 5 ms TR can also be achieved at 28 ° flip angle and 20 ms TR; or the FRE curve at 18 ° flip angle and 30 ms TR is comparable to the FRE curve at 8 ° flip angle and 5 ms TR (Supplementary Figure 3 TOP). In addition, the difference between optimal parameter settings diminishes for long blood delivery times, such that at a blood delivery time of 500 ms (Supplementary Figure 3 BOTTOM), the optimal flip angle at a TR of 15 ms, 20 ms or 25 ms would be 14 °, 16 ° and 18 °, respectively. This is in contrast to a blood delivery time of 100 ms, where the optimal flip angles would be 32 °, 37 ° and 41 °. In conclusion, in the regime of small arteries, long TR values in combination with low flip angles ensure flow-related enhancement at blood delivery times of 200 ms and above, and within this regime there are marginal gains by further optimizing parameter values and the optimal values are all similar."

      Supplementary Figure 3: Optimal imaging parameters for small arteries. This assessment follows the simulations presented in Figure 3, but in addition shows the interdependency for the corresponding third parameter (either flip angle or repetition time). TOP: Flip angles close to the Ernst angle show only a marginal flow-related enhancement; however, the influence of the blood delivery time decreases further (LEFT). As the flip angle increases well above the values used in this study, the flow-related enhancement in the small artery regime remains low even for the longer repetition times considered here (RIGHT). BOTTOM: The optimal excitation flip angle shows reduced variability across repetition times in the small artery regime compared to shorter blood delivery times.

      "Based on these equations, optimal T_R and excitation flip angle values (θ) can be calculated for the blood delivery times under consideration (Figure 3). To better illustrate the regime of small arteries, we have illustrated the effect of either flip angle or T_R while keeping the other parameter values fixed to the value that was ultimately used in the experiments; although both parameters can also be optimized simultaneously (Haacke et al., 1990). Supplementary Figure 3 further delineates the interdependency between flip angle and T_R within a parameter range commonly used for TOF imaging at ultra-high field (Kang et al., 2010; Stamm et al., 2013; von Morze et al., 2007). Note how longer T_R values still provide an FRE effect even at very long blood delivery times, whereas using shorter T_R values can suppress the FRE effect (Figure 3, left). Similarly, at lower flip angles the FRE effect is still present for long blood delivery times, but it is not available anymore at larger flip angles, which, however, would give maximum FRE for shorter blood delivery times (Figure 3, right). Due to the non-linear relationships of both blood delivery time and flip angle with FRE, the optimal imaging parameters deviate considerably when comparing blood delivery times of 100 ms and 300 ms, but the differences between 300 ms and 1000 ms are less pronounced. In the following simulations and measurements, we have thus used a T_R value of 20 ms, i.e. a value only slightly longer than the readout of the high-resolution TOF acquisitions, which allowed time-efficient data acquisition, and a nominal excitation flip angle of 18°. From a practical standpoint, these values are also favorable as the low flip angle reduces the specific absorption rate (Fiedler et al., 2018) and the long T_R value decreases the potential for peripheral nerve stimulation (Mansfield and Harvey, 1993)."

      2) Figure 4 and Theory surrounding. A major limitation of this analysis is the lack of inclusion of noise in the analysis. I believe the results to be obvious that the FRE will be modulated by partial volume effects, here described quadratically by assuming the vessel to pass through the voxel. This would substantially modify the analysis, with a shift towards higher voxel volumes (scan time being equal). The authors suggest the FRE to be the dominant factor effecting segmentation; however, segmentation is limited by noise as much as contrast.

      We of course agree with the reviewer that contrast-to-noise ratio is a key factor that determines the detection of vessels and the quality of the segmentation, however there are subtleties regarding the exact inter-relationship between CNR, resolution, and segmentation performance.

      The main purpose of Figure 4 is not to provide a trade-off between flow-related enhancement and signal-to-noise ratio—in particular as SNR is modulated by many more factors than voxel size alone, e.g. acquisition time, coil geometry and instrumentation—but to decide whether the limiting factor for imaging pial arteries is the reduction in flow-related enhancement due to long blood delivery times (which is the explanation often found in the literature (Chen et al., 2018; Haacke et al., 1990; Masaryk et al., 1989; Mut et al., 2014; Park et al., 2020; Parker et al., 1991; Wilms et al., 2001; Wright et al., 2013)) or due to partial volume effects. Furthermore, when reducing voxel size one will also likely increase the number of encoding steps to maintain the imaging coverage (i.e., the field-of-view) and so the relationship between voxel size and SNR in practice is not straightforward. Therefore, we had to conclude that deducing a meaningful SNR analysis that would benefit the reader was not possible given the available data due to the complex relationship between voxel size and other imaging parameters. Note that these considerations are not specific to imaging the pial arteries but apply to all MRA acquisitions, and have thus been discussed previously in the literature. Here, we wanted to focus on the novel insights gained in our study, namely that it provides an expression for how relative FRE contrast changes with voxel size with some assumptions that apply for imaging pial arteries.

      Further, depending on the definition of FRE and whether partial-volume effects are included (see also our response to R2.8), larger voxel volumes have been found to be theoretically advantageous even when only considering contrast (Du et al., 1996; Venkatesan and Haacke, 1997), which is not in line with empirical observations (Al-Kwifi et al., 2002; Bouvy et al., 2014; Haacke et al., 1990; Ladd, 2007; Mattern et al., 2018; von Morze et al., 2007).

      The notion that vessel segmentation algorithms perform well on noisy data but poorly on low-contrast data was mainly driven by our own experiences. However, we still believe that the assumption that (all) segmentation algorithms are linearly dependent on contrast and noise (which the formulation of a contrast-to-noise ratio presumes) is similarly not warranted. Indeed, the necessary trade-off between FRE and SNR might be specific to the particular segmentation algorithm being used than a general property of the acquisition. Please also note that our analysis of the FRE does not suggest that an arbitrarily high resolution is needed. Importantly, while we previously noted that reducing voxel size improves contrast in vessels whose diameters are smaller than the voxel size, we now explicitly acknowledge that, for vessels whose diameters are larger than the voxel size reducing the voxel size is not helpful---since it only reduces SNR without any gain in contrast---and may hinder segmentation performance, and thus become counterproductive. But we take the reviewer’s point and also acknowledge that these intricacies need to be mentioned, and therefore we have rephrased the statement in the discussion in the following way:

      "In general, we have not considered SNR, but only FRE, i.e. the (relative) image contrast, assuming that segmentation algorithms would benefit from higher contrast for smaller arteries. Importantly, the acquisition parameters available to maximize FRE are limited, namely repetition time, flip angle and voxel size. SNR, however, can be improved via numerous avenues independent of these parameters (Brown et al., 2014b; Du et al., 1996; Heverhagen et al., 2008; Parker et al., 1991; Triantafyllou et al., 2011; Venkatesan and Haacke, 1997), the simplest being longer acquisition times. If the aim is to optimize a segmentation outcome for a given acquisition time, the trade-off between contrast and SNR for the specific segmentation algorithm needs to be determined (Klepaczko et al., 2016; Lesage et al., 2009; Moccia et al., 2018; Phellan and Forkert, 2017). Our own—albeit limited—experience has shown that segmentation algorithms (including manual segmentation) can accommodate a perhaps surprising amount of noise using prior knowledge and neighborhood information, making these high-resolution acquisitions possible. Importantly, note that our treatment of the FRE does not suggest that an arbitrarily small voxel size is needed, but instead that voxel sizes appropriate for the arterial diameter of interest are beneficial (in line with the classic “matched-filter” rationale (North, 1963)). Voxels smaller than the arterial diameter would not yield substantial benefits (Figure 5) and may result in SNR reductions that would hinder segmentation performance."

      3) Page 11, Line 225. "only a fraction of the blood is replaced" I think the language should be reworded. There are certainly water molecules in blood which have experience more excitation B1 pulses due to the parabolic flow upstream and the temporal variation in flow. There is magnetization diffusion which reduces the discrepancy; however, it seems pertinent to just say the authors assume the signal is represented by the average arrival time. This analysis is never verified and is only approximate anyways. The "blood dwell time" is also an average since voxels near the wall will travel more slowly. Overall, I recommend reducing the conjecture in this section.

      We fully agree that our treatment of the blood dwell time does not account for the much more complex flow patterns found in cortical arteries. However, our aim was not do comment on these complex patterns, but to help establish if, in the simplest scenario assuming plug flow, the often-mentioned slow blood flow requires multiple velocity compartments to describe the FRE (as is commonly done for 2D MRA (Brown et al., 2014a; Carr and Carroll, 2012)). We did not intend to comment on the effects of laminar flow or even more complex flow patterns, which would require a more in-depth treatment. However, as the small arteries targeted here are often just one voxel thick, all signals are indeed integrated within that voxel (i.e. there is no voxel near the wall that travels more slowly), which may average out more complex effects. We have clarified the purpose and scope of this section in the following way:

      "In classical descriptions of the FRE effect (Brown et al., 2014a; Carr and Carroll, 2012), significant emphasis is placed on the effect of multiple “velocity segments” within a slice in the 2D imaging case. Using the simplified plug-flow model, where the cross-sectional profile of blood velocity within the vessel is constant and effects such as drag along the vessel wall are not considered, these segments can be described as ‘disks’ of blood that do not completely traverse through the full slice within one T_R, and, thus, only a fraction of the blood in the slice is replaced. Consequently, estimation of the FRE effect would then need to accommodate contribution from multiple ‘disks’ that have experienced 1 to k RF pulses. In the case of 3D imaging as employed here, multiple velocity segments within one voxel are generally not considered, as the voxel sizes in 3D are often smaller than the slice thickness in 2D imaging and it is assumed that the blood completely traverses through a voxel each T_R. However, the question arises whether this assumption holds for pial arteries, where blood velocity is considerably lower than in intracranial vessels (Figure 2). To answer this question, we have computed the blood dwell time , i.e. the average time it takes the blood to traverse a voxel, as a function of blood velocity and voxel size (Figure 2). For reference, the blood velocity estimates from the three studies mentioned above (Bouvy et al., 2016; Kobari et al., 1984; Nagaoka and Yoshida, 2006) have been added in this plot as horizontal white lines. For the voxel sizes of interest here, i.e. 50–300 μm, blood dwell times are, for all but the slowest flows, well below commonly used repetition times (Brown et al., 2014a; Carr and Carroll, 2012; Ladd, 2007; von Morze et al., 2007). Thus, in a first approximation using the plug-flow model, it is not necessary to include several velocity segments for the voxel sizes of interest when considering pial arteries, as one might expect from classical treatments, and the FRE effect can be described by equations (1) – (3), simplifying our characterization of FRE for these vessels. When considering the effect of more complex flow patterns, it is important to bear in mind that the arteries targeted here are only one-voxel thick, and signals are integrated across the whole artery."

      4) Page 13, Line 260. "two-compartment modelling" I think this section is better labeled "Extension to consider partial volume effects" The compartments are not interacting in any sense in this work.

      Thank you for this suggestion. We have replaced the heading with Introducing a partial-volume model (page 14) and replaced all instances of ‘two-compartment model’ with ‘partial-volume model’.

      5) Page 14, Line 284. "In practice, a reduction in slab …." "reducing the voxel size is a much more promising avenue" There is a fair amount on conjecture here which is not supported by experiments. While this may be true, the authors also use a classical approach with quite thin slabs.

      The slab thickness used in our experiments was mainly limited by the acquisition time and the participants ability to lie still. We indeed performed one measurement with a very experienced participant with a thicker slab, but found that with over 20 minutes acquisition time, motion artefacts were unavoidable. The data presented in Figure 5 were acquired with similar slab thickness, supporting the statement that reducing the voxel size is a promising avenue for imaging small pial arteries. However, we indeed have not provided an empirical comparison of the effect of slab thickness. Nevertheless, we believe it remains useful to make the theoretical argument that due to the convoluted nature of the pial arterial vascular geometry, a reduction in slab thickness may not reduce the acquisition time if no reduction in intra-slab vessel length can be achieved, i.e. if the majority of the artery is still contained in the smaller slab. We have clarified the statement and removed the direct comparison (‘much more’ promising) in the following way:

      "In theory, a reduction in blood delivery time increases the FRE in both regimes, and—if the vessel is smaller than the voxel—so would a reduction in voxel size. In practice, a reduction in slab thickness―which is the default strategy in classical TOF-MRA to reduce blood delivery time―might not provide substantial FRE increases for pial arteries. This is due to their convoluted geometry (see section Anatomical architecture of the pial arterial vasculature), where a reduction in slab thickness may not necessarily reduce the vessel segment length if the majority of the artery is still contained within the smaller slab. Thus, given the small arterial diameter, reducing the voxel size is a promising avenue when imaging the pial arterial vasculature."

      6) Figure 5. These image differences are highly exaggerated by the lack of zero filling (or any interpolation) and the fact that the wildly different. The interpolation should be addressed, and the scan time discrepancy listed as a limitation.

      We have extended the discussion around zero-filling by including additional considerations based on the imaging parameters in Figure 5 and highlighted the substantial differences in voxel volume. Our choice not to perform zero-filling was driven by the open question of what an ‘optimal’ zero-filling factor would be. We have also highlighted the substantial differences in acquisition time when describing the results.

      Changes made to the results section:

      "To investigate the effect of voxel size on vessel FRE, we acquired data at four different voxel sizes ranging from 0.8 mm to 0.3 mm isotropic resolution, adjusting only the encoding matrix, with imaging parameters being otherwise identical (FOV, TR, TE, flip angle, R, slab thickness, see section Data acquisition). The total acquisition time increases from less than 2 minutes for the lowest resolution scan to over 6 minutes for the highest resolution scan as a result."

      Changes made to the discussion section:

      "Nevertheless, slight qualitative improvements in image appearance have been reported for higher zero-filling factors (Du et al., 1994), presumably owing to a smoother representation of the vessels (Bartholdi and Ernst, 1973). In contrast, Mattern et al. (2018) reported no improvement in vessel contrast for their high-resolution data. Ultimately, for each application, e.g. visual evaluation vs. automatic segmentation, the optimal zero-filling factor needs to be determined, balancing image appearance (Du et al., 1994; Zhu et al., 2013) with loss in statistical independence of the image noise across voxels. For example, in Figure 5, when comparing across different voxel sizes, the visual impression might improve with zero-filling. However, it remains unclear whether the same zero-filling factor should be applied for each voxel size, which means that the overall difference in resolution remains, namely a nearly 20-fold reduction in voxel volume when moving from 0.8-mm isotropic to 0.3-mm isotropic voxel size. Alternatively, the same ’zero-filled’ voxel sizes could be used for evaluation, although then nearly 94 % of the samples used to reconstruct the image with 0.8-mm voxel size would be zero-valued for a 0.3-mm isotropic resolution. Consequently, all data presented in this study were reconstructed without zero-filling."

      7) Figure 7. Given the limited nature of experiment may it not also be possible the subject moved more, had differing brain blood flow, etc. Were these lengthy scans acquired in the same session? Many of these differences could be attributed to other differences than the small difference in spatial resolution.

      The scans were acquired in the same session using the same prospective motion correction procedure. Note that the acquisition time of the images with 0.16 mm isotropic voxel size was comparatively short, taking just under 12 minutes. Although the difference in spatial resolution may seem small, it still amounts to a 33% reduction in voxel volume. For comparison, reducing the voxel size from 0.4 mm to 0.3 mm also ‘only’ reduces the voxel volume by 58 %—not even twice as much. Overall, we fully agree that additional validation and optimisation of the imaging parameters for pial arteries are beneficial and have added a corresponding statement to the Discussion section.

      Changes made to the results section (also in response to Reviewer 1 (R1.22))

      "We have also acquired one single slab with an isotropic voxel size of 0.16 mm with prospective motion correction for this participant in the same session to compare to the acquisition with 0.14 mm isotropic voxel size and to test whether any gains in FRE are still possible at this level of the vascular tree."

      Changes made to the discussion section:

      "Acquiring these data at even higher field strengths would boost SNR (Edelstein et al., 1986; Pohmann et al., 2016) to partially compensate for SNR losses due to acceleration and may enable faster imaging and/or smaller voxel sizes. This could facilitate the identification of the ultimate limit of the flow-related enhancement effect and identify at which stage of the vascular tree does the blood delivery time become the limiting factor. While Figure 7 indicates the potential for voxel sizes below 0.16 mm, the singular nature of this comparison warrants further investigations."

      8) Page 22, Line 395. Would the analysis be any different with an absolute difference? The FRE (Eq 6) divides by a constant value. Clearly there is value in the difference as other subtractive inflow imaging would have infinite FRE (not considering noise as the authors do).

      Absolutely; using an absolute FRE would result in the highest FRE for the largest voxel size, whereas in our data small vessels are more easily detected with the smallest voxel size. We also note that relative FRE would indeed become infinite if the value in the denominator representing the tissue signal was zero, but this special case highlights how relative FRE can help characterize “segmentability”: a vessel with any intensity surrounded by tissue with an intensity of zero is trivially/infinitely segmentatble. We have added this point to the revised manuscript as indicated below.

      Following the suggestion of Reviewer 1 (R1.2), we have included additional simulations to clarify the effects of relative FRE definition and partial-volume model, in which we show that only when considering both together are smaller voxel sizes advantageous (Supplementary Material).

      "Effect of FRE Definition and Interaction with Partial-Volume Model

      For the definition of the FRE effect in this study, we used a measure of relative FRE (Al-Kwifi et al., 2002) in combination with a partial-volume model (Eq. 6). To illustrate the effect of these two definitions, as well as their interaction, we have estimated the relative and absolute FRE for an artery with a diameter of 200 µm and 2 000 µm (i.e. no partial-volume effects). The absolute FRE explicitly takes the voxel volume into account, i.e. instead of Eq. (6) for the relative FRE we used"

      Eq. (1)

      Note that the division by

      to obtain the relative FRE removes the contribution of the total voxel volume

      "Supplementary Figure 2 shows that, when partial volume effects are present, the highest relative FRE arises in voxels with the same size as or smaller than the vessel diameter (Supplementary Figure 2A), whereas the absolute FRE increases with voxel size (Supplementary Figure 2C). If no partial-volume effects are present, the relative FRE becomes independent of voxel size (Supplementary Figure 2B), whereas the absolute FRE increases with voxel size (Supplementary Figure 2D). While the partial-volume effects for the relative FRE are substantial, they are much more subtle when using the absolute FRE and do not alter the overall characteristics."

      Supplementary Figure 2: Effect of voxel size and blood delivery time on the relative flow-related enhancement (FRE) using either a relative (A,B) (Eq. (3)) or an absolute (C,D) (Eq. (12)) FRE definition assuming a pial artery diameter of 200 μm (A,C) or 2 000 µm, i.e. no partial-volume effects at the central voxel of this artery considered here.

      Following the established literature (Brown et al., 2014a; Carr and Carroll, 2012; Haacke et al., 1990) and because we would ultimately derive a relative measure, we have omitted the effect of voxel volume on the longitudinal magnetization in our derivations, which make it appear as if we are dividing by a constant in Eq. 6, as the effect of total voxel volume cancels out for the relative FRE. We have now made this more explicit in our derivation of the partial volume model.

      "Introducing a partial-volume model

      To account for the effect of voxel volume on the FRE, the total longitudinal magnetization M_z needs to also consider the number of spins contained within in a voxel (Du et al., 1996; Venkatesan and Haacke, 1997). A simple approximation can be obtained by scaling the longitudinal magnetization with the voxel volume (Venkatesan and Haacke, 1997) . To then include partial volume effects, the total longitudinal magnetization in a voxel M_z^total becomes the sum of the contributions from the stationary tissue M_zS^tissue and the inflowing blood M_z^blood, weighted by their respective volume fractions V_rel:"

      A simple approximation can be obtained by scaling the longitudinal magnetization with the voxel volume (Venkatesan and Haacke, 1997) . To then include partial volume effects, the total longitudinal magnetization in a voxel M_z^total becomes the sum of the contributions from the stationary tissue M_zS^tissue and the inflowing blood M_z^blood, weighted by their respective volume fractions V_rel:

      Eq. (4)

      For simplicity, we assume a single vessel is located at the center of the voxel and approximate it to be a cylinder with diameter d_vessel and length l_voxel of an assumed isotropic voxel along one side. The relative volume fraction of blood V_rel^blood is the ratio of vessel volume within the voxel to total voxel volume (see section Estimation of vessel-volume fraction in the Supplementary Material), and the tissue volume fraction V_rel^tissue is the remainder that is not filled with blood, or

      Eq. (5)

      We can now replace the blood magnetization in equation Eq. (3) with the total longitudinal magnetization of the voxel to compute the FRE as a function of vessel-volume fraction:

      Eq. (6)

      Based on your suggestion, we have also extended our interpretation of relative and absolute FRE. Indeed, a subtractive flow technique where no signal in the background remains and only intensities in the object are present would have infinite relative FRE, as this basically constitutes a perfect segmentation (bar a simple thresholding step).

      "Extending classical FRE treatments to the pial vasculature

      There are several major modifications in our approach to this topic that might explain why, in contrast to predictions from classical FRE treatments, it is indeed possible to image pial arteries. For instance, the definition of vessel contrast or flow-related enhancement is often stated as an absolute difference between blood and tissue signal (Brown et al., 2014a; Carr and Carroll, 2012; Du et al., 1993, 1996; Haacke et al., 1990; Venkatesan and Haacke, 1997). Here, however, we follow the approach of Al-Kwifi et al. (2002) and consider relative contrast. While this distinction may seem to be semantic, the effect of voxel volume on FRE for these two definitions is exactly opposite: Du et al. (1996) concluded that larger voxel size increases the (absolute) vessel-background contrast, whereas here we predict an increase in relative FRE for small arteries with decreasing voxel size. Therefore, predictions of the depiction of small arteries with decreasing voxel size differ depending on whether one is considering absolute contrast, i.e. difference in longitudinal magnetization, or relative contrast, i.e. contrast differences independent of total voxel size. Importantly, this prediction changes for large arteries where the voxel contains only vessel lumen, in which case the relative FRE remains constant across voxel sizes, but the absolute FRE increases with voxel size (Supplementary Figure 9). Overall, the interpretations of relative and absolute FRE differ, and one measure may be more appropriate for certain applications than the other. Absolute FRE describes the difference in magnetization and is thus tightly linked to the underlying physical mechanism. Relative FRE, however, describes the image contrast and segmentability. If blood and tissue magnetization are equal, both contrast measures would equal zero and indicate that no contrast difference is present. However, when there is signal in the vessel and as the tissue magnetization approaches zero, the absolute FRE approaches the blood magnetization (assuming no partial-volume effects), whereas the relative FRE approaches infinity. While this infinite relative FRE does not directly relate to the underlying physical process of ‘infinite’ signal enhancement through inflowing blood, it instead characterizes the segmentability of the image in that an image with zero intensity in the background and non-zero values in the structures of interest can be segmented perfectly and trivially. Accordingly, numerous empirical observations (Al-Kwifi et al., 2002; Bouvy et al., 2014; Haacke et al., 1990; Ladd, 2007; Mattern et al., 2018; von Morze et al., 2007) and the data provided here (Figure 5, 6 and 7) have shown the benefit of smaller voxel sizes if the aim is to visualize and segment small arteries."

      9) Page 22, Line 400. "The appropriateness of " This also ignores noise. The absolute enhancement is the inherent magnetization available. The results in Figure 5, 6, 7 don't readily support a ratio over and absolute difference accounting for partial volume effects.

      We hope that with the additional explanations on the effects of relative FRE definition in combination with a partial-volume model and the interpretation of relative FRE provided in the previous response (R2.8) and that Figures 5, 6 and 7 show smaller arteries for smaller voxels, we were able to clarify our argument why only relative FRE in combination with a partial volume model can explain why smaller voxel sizes are advantageous for depicting small arteries.

      While we appreciate that there exists a fundamental relationship between SNR and voxel volume in MR (Brown et al., 2014b), this relationship is also modulated by many more factors (as we have argued in our responses to R2.2 and R1.4b).

      We hope that the additional derivations and simulations provided in the previous response have clarified why a relative FRE model in combination with a partial-volume model helps to explain the enhanced detectability of small vessels with small voxels.

      10) Page 24, Line 453. "strategies, such as radial and spiral acquisitions, experience no vessel displacement artefact" These do observe flow related distortions as well, just not typically called displacement.

      Yes, this is a helpful point, as these methods will also experience a degradation of spatial accuracy due to flow effects, which will propagate into errors in the segmentation.

      As the reviewer suggests, flow-related artefacts in radial and spiral acquisitions usually manifest as a slight blur, and less as the prominent displacement found in Cartesian sampling schemes. We have added a corresponding clarification to the Discussion section:

      "Other encoding strategies, such as radial and spiral acquisitions, experience no vessel displacement artefact because phase and frequency encoding take place in the same instant; although a slight blur might be observed instead (Nishimura et al., 1995, 1991). However, both trajectories pose engineering challenges and much higher demands on hardware and reconstruction algorithms than the Cartesian readouts employed here (Kasper et al., 2018; Shu et al., 2016); particularly to achieve 3D acquisitions with 160 µm isotropic resolution."

      11) Page 24, Line 272. "although even with this nearly ideal subject behaviour approximately 1 in 4 scans still had to be discarded and repeated" This is certainly a potential source of bias in the comparisons.

      We apologize if this section was written in a misleading way. For the comparison presented in Figure 7, we acquired one additional slab in the same session at 0.16 mm voxel size using the same prospective motion correction procedure as for the 0.14 mm data. For the images shown in Figure 6 and Supplementary Figure 4 at 0.16 mm voxel size, we did not use a motion correction system and, thus, had to discard a portion of the data. We have clarified that for the comparison of the high-resolution data, prospective motion correction was used for both resolutions. We have clarified this in the Discussion section:

      "This allowed for the successful correction of head motion of approximately 1 mm over the 60-minute scan session, showing the utility of prospective motion correction at these very high resolutions. Note that for the comparison in Figure 7, one slab with 0.16 mm voxel size was acquired in the same session also using the prospective motion correction system. However, for the data shown in Figure 6 and Supplementary Figure 4, no prospective motion correction was used, and we instead relied on the experienced participants who contributed to this study. We found that the acquisition of TOF data with 0.16 mm isotropic voxel size in under 12 minutes acquisition time per slab is possible without discernible motion artifacts, although even with this nearly ideal subject behaviour approximately 1 in 4 scans still had to be discarded and repeated."

      12) Page 25, Line 489. "then need to include the effects of various analog and digital filters" While the analysis may benefit from some of this, most is not at all required for analysis based on optimization of the imaging parameters.

      We have included all four correction factors for completeness, given the unique acquisition parameter and contrast space our time-of-flight acquisition occupies, e.g. very low bandwidth of only 100 Hz, very large matrix sizes > 1024 samples, ideally zero SNR in the background (fully supressed tissue signal). However, we agree that probably the most important factor is the non-central chi distribution of the noise in magnitude images from multiple-channel coil arrays, and have added this qualification in the text:

      "Accordingly, SNR predictions then need to include the effects of various analog and digital filters, the number of acquired samples, the noise covariance correction factor, and—most importantly—the non-central chi distribution of the noise statistics of the final magnitude image (Triantafyllou et al., 2011)."

      Al-Kwifi, O., Emery, D.J., Wilman, A.H., 2002. Vessel contrast at three Tesla in time-of-flight magnetic resonance angiography of the intracranial and carotid arteries. Magnetic Resonance Imaging 20, 181–187. https://doi.org/10.1016/S0730-725X(02)00486-1

      Arts, T., Meijs, T.A., Grotenhuis, H., Voskuil, M., Siero, J., Biessels, G.J., Zwanenburg, J., 2021. Velocity and Pulsatility Measures in the Perforating Arteries of the Basal Ganglia at 3T MRI in Reference to 7T MRI. Frontiers in Neuroscience 15. Avants, B.B., Tustison, N., Song, G., 2009. Advanced normalization tools (ANTS). Insight j 2, 1–35. Bae, K.T., Park, S.-H., Moon, C.-H., Kim, J.-H., Kaya, D., Zhao, T., 2010. Dual-echo arteriovenography imaging with 7T MRI: CODEA with 7T. J. Magn. Reson. Imaging 31, 255–261. https://doi.org/10.1002/jmri.22019

      Bartholdi, E., Ernst, R.R., 1973. Fourier spectroscopy and the causality principle. Journal of Magnetic Resonance (1969) 11, 9–19. https://doi.org/10.1016/0022-2364(73)90076-0

      Bernier, M., Cunnane, S.C., Whittingstall, K., 2018. The morphology of the human cerebrovascular system. Human Brain Mapping 39, 4962–4975. https://doi.org/10.1002/hbm.24337

      Bouvy, W.H., Biessels, G.J., Kuijf, H.J., Kappelle, L.J., Luijten, P.R., Zwanenburg, J.J.M., 2014. Visualization of Perivascular Spaces and Perforating Arteries With 7 T Magnetic Resonance Imaging: Investigative Radiology 49, 307–313. https://doi.org/10.1097/RLI.0000000000000027

      Bouvy, W.H., Geurts, L.J., Kuijf, H.J., Luijten, P.R., Kappelle, L.J., Biessels, G.J., Zwanenburg, J.J.M., 2016. Assessment of blood flow velocity and pulsatility in cerebral perforating arteries with 7-T quantitative flow MRI: Blood Flow Velocity And Pulsatility In Cerebral Perforating Arteries. NMR Biomed. 29, 1295–1304. https://doi.org/10.1002/nbm.3306

      Brown, R.W., Cheng, Y.-C.N., Haacke, E.M., Thompson, M.R., Venkatesan, R., 2014a. Chapter 24 - MR Angiography and Flow Quantification, in: Magnetic Resonance Imaging. John Wiley & Sons, Ltd, pp. 701–737. https://doi.org/10.1002/9781118633953.ch24

      Brown, R.W., Cheng, Y.-C.N., Haacke, E.M., Thompson, M.R., Venkatesan, R., 2014b. Chapter 15 - Signal, Contrast, and Noise, in: Magnetic Resonance Imaging. John Wiley & Sons, Ltd, pp. 325–373. https://doi.org/10.1002/9781118633953.ch15

      Carr, J.C., Carroll, T.J., 2012. Magnetic resonance angiography: principles and applications. Springer, New York. Cassot, F., Lauwers, F., Fouard, C., Prohaska, S., Lauwers-Cances, V., 2006. A Novel Three-Dimensional Computer-Assisted Method for a Quantitative Study of Microvascular Networks of the Human Cerebral Cortex. Microcirculation 13, 1–18. https://doi.org/10.1080/10739680500383407

      Chen, L., Mossa-Basha, M., Balu, N., Canton, G., Sun, J., Pimentel, K., Hatsukami, T.S., Hwang, J.-N., Yuan, C., 2018. Development of a quantitative intracranial vascular features extraction tool on 3DMRA using semiautomated open-curve active contour vessel tracing: Comprehensive Artery Features Extraction From 3D MRA. Magn. Reson. Med 79, 3229–3238. https://doi.org/10.1002/mrm.26961

      Choi, U.-S., Kawaguchi, H., Kida, I., 2020. Cerebral artery segmentation based on magnetization-prepared two rapid acquisition gradient echo multi-contrast images in 7 Tesla magnetic resonance imaging. NeuroImage 222, 117259. https://doi.org/10.1016/j.neuroimage.2020.117259

      Conolly, S., Nishimura, D., Macovski, A., Glover, G., 1988. Variable-rate selective excitation. Journal of Magnetic Resonance (1969) 78, 440–458. https://doi.org/10.1016/0022-2364(88)90131-X

      Deistung, A., Dittrich, E., Sedlacik, J., Rauscher, A., Reichenbach, J.R., 2009. ToF-SWI: Simultaneous time of flight and fully flow compensated susceptibility weighted imaging. J. Magn. Reson. Imaging 29, 1478–1484. https://doi.org/10.1002/jmri.21673

      Detre, J.A., Leigh, J.S., Williams, D.S., Koretsky, A.P., 1992. Perfusion imaging. Magnetic Resonance in Medicine 23, 37–45. https://doi.org/10.1002/mrm.1910230106

      Du, Y., Parker, D.L., Davis, W.L., Blatter, D.D., 1993. Contrast-to-Noise-Ratio Measurements in Three-Dimensional Magnetic Resonance Angiography. Investigative Radiology 28, 1004–1009. Du, Y.P., Jin, Z., 2008. Simultaneous acquisition of MR angiography and venography (MRAV). Magn. Reson. Med. 59, 954–958. https://doi.org/10.1002/mrm.21581

      Du, Y.P., Parker, D.L., Davis, W.L., Cao, G., 1994. Reduction of partial-volume artifacts with zero-filled interpolation in three-dimensional MR angiography. J. Magn. Reson. Imaging 4, 733–741. https://doi.org/10.1002/jmri.1880040517

      Du, Y.P., Parker, D.L., Davis, W.L., Cao, G., Buswell, H.R., Goodrich, K.C., 1996. Experimental and theoretical studies of vessel contrast-to-noise ratio in intracranial time-of-flight MR angiography. Journal of Magnetic Resonance Imaging 6, 99–108. https://doi.org/10.1002/jmri.1880060120

      Duvernoy, H., Delon, S., Vannson, J.L., 1983. The Vascularization of The Human Cerebellar Cortex. Brain Research Bulletin 11, 419–480. Duvernoy, H.M., Delon, S., Vannson, J.L., 1981. Cortical blood vessels of the human brain. Brain Research Bulletin 7, 519–579. https://doi.org/10.1016/0361-9230(81)90007-1

      Eckstein, K., Bachrata, B., Hangel, G., Widhalm, G., Enzinger, C., Barth, M., Trattnig, S., Robinson, S.D., 2021. Improved susceptibility weighted imaging at ultra-high field using bipolar multi-echo acquisition and optimized image processing: CLEAR-SWI. NeuroImage 237, 118175. https://doi.org/10.1016/j.neuroimage.2021.118175

      Edelstein, W.A., Glover, G.H., Hardy, C.J., Redington, R.W., 1986. The intrinsic signal-to-noise ratio in NMR imaging. Magn. Reson. Med. 3, 604–618. https://doi.org/10.1002/mrm.1910030413

      Fan, A.P., Govindarajan, S.T., Kinkel, R.P., Madigan, N.K., Nielsen, A.S., Benner, T., Tinelli, E., Rosen, B.R., Adalsteinsson, E., Mainero, C., 2015. Quantitative oxygen extraction fraction from 7-Tesla MRI phase: reproducibility and application in multiple sclerosis. J Cereb Blood Flow Metab 35, 131–139. https://doi.org/10.1038/jcbfm.2014.187

      Fiedler, T.M., Ladd, M.E., Bitz, A.K., 2018. SAR Simulations & Safety. NeuroImage 168, 33–58. https://doi.org/10.1016/j.neuroimage.2017.03.035

      Frässle, S., Aponte, E.A., Bollmann, S., Brodersen, K.H., Do, C.T., Harrison, O.K., Harrison, S.J., Heinzle, J., Iglesias, S., Kasper, L., Lomakina, E.I., Mathys, C., Müller-Schrader, M., Pereira, I., Petzschner, F.H., Raman, S., Schöbi, D., Toussaint, B., Weber, L.A., Yao, Y., Stephan, K.E., 2021. TAPAS: An Open-Source Software Package for Translational Neuromodeling and Computational Psychiatry. Front. Psychiatry 12. https://doi.org/10.3389/fpsyt.2021.680811

      Gulban, O.F., Bollmann, S., Huber, R., Wagstyl, K., Goebel, R., Poser, B.A., Kay, K., Ivanov, D., 2021. Mesoscopic Quantification of Cortical Architecture in the Living Human Brain. https://doi.org/10.1101/2021.11.25.470023

      Haacke, E.M., Masaryk, T.J., Wielopolski, P.A., Zypman, F.R., Tkach, J.A., Amartur, S., Mitchell, J., Clampitt, M., Paschal, C., 1990. Optimizing blood vessel contrast in fast three-dimensional MRI. Magn. Reson. Med. 14, 202–221. https://doi.org/10.1002/mrm.1910140207

      Helthuis, J.H.G., van Doormaal, T.P.C., Hillen, B., Bleys, R.L.A.W., Harteveld, A.A., Hendrikse, J., van der Toorn, A., Brozici, M., Zwanenburg, J.J.M., van der Zwan, A., 2019. Branching Pattern of the Cerebral Arterial Tree. Anat Rec 302, 1434–1446. https://doi.org/10.1002/ar.23994

      Heverhagen, J.T., Bourekas, E., Sammet, S., Knopp, M.V., Schmalbrock, P., 2008. Time-of-Flight Magnetic Resonance Angiography at 7 Tesla. Investigative Radiology 43, 568–573. https://doi.org/10.1097/RLI.0b013e31817e9b2c

      Hirsch, S., Reichold, J., Schneider, M., Székely, G., Weber, B., 2012. Topology and Hemodynamics of the Cortical Cerebrovascular System. J Cereb Blood Flow Metab 32, 952–967. https://doi.org/10.1038/jcbfm.2012.39

      Horn, B.K.P., Schunck, B.G., 1981. Determining optical flow. Artificial Intelligence 17, 185–203. https://doi.org/10.1016/0004-3702(81)90024-2

      Huck, J., Wanner, Y., Fan, A.P., Jäger, A.-T., Grahl, S., Schneider, U., Villringer, A., Steele, C.J., Tardif, C.L., Bazin, P.-L., Gauthier, C.J., 2019. High resolution atlas of the venous brain vasculature from 7 T quantitative susceptibility maps. Brain Struct Funct 224, 2467–2485. https://doi.org/10.1007/s00429-019-01919-4

      Johst, S., Wrede, K.H., Ladd, M.E., Maderwald, S., 2012. Time-of-Flight Magnetic Resonance Angiography at 7 T Using Venous Saturation Pulses With Reduced Flip Angles. Investigative Radiology 47, 445–450. https://doi.org/10.1097/RLI.0b013e31824ef21f

      Kang, C.-K., Park, C.-A., Kim, K.-N., Hong, S.-M., Park, C.-W., Kim, Y.-B., Cho, Z.-H., 2010. Non-invasive visualization of basilar artery perforators with 7T MR angiography. Journal of Magnetic Resonance Imaging 32, 544–550. https://doi.org/10.1002/jmri.22250

      Kasper, L., Engel, M., Barmet, C., Haeberlin, M., Wilm, B.J., Dietrich, B.E., Schmid, T., Gross, S., Brunner, D.O., Stephan, K.E., Pruessmann, K.P., 2018. Rapid anatomical brain imaging using spiral acquisition and an expanded signal model. NeuroImage 168, 88–100. https://doi.org/10.1016/j.neuroimage.2017.07.062

      Klepaczko, A., Szczypiński, P., Deistung, A., Reichenbach, J.R., Materka, A., 2016. Simulation of MR angiography imaging for validation of cerebral arteries segmentation algorithms. Computer Methods and Programs in Biomedicine 137, 293–309. https://doi.org/10.1016/j.cmpb.2016.09.020

      Kobari, M., Gotoh, F., Fukuuchi, Y., Tanaka, K., Suzuki, N., Uematsu, D., 1984. Blood Flow Velocity in the Pial Arteries of Cats, with Particular Reference to the Vessel Diameter. J Cereb Blood Flow Metab 4, 110–114. https://doi.org/10.1038/jcbfm.1984.15

      Ladd, M.E., 2007. High-Field-Strength Magnetic Resonance: Potential and Limits. Top Magn Reson Imaging 18, 139–152. Lesage, D., Angelini, E.D., Bloch, I., Funka-Lea, G., 2009. A review of 3D vessel lumen segmentation techniques: Models, features and extraction schemes. Medical Image Analysis 13, 819–845. https://doi.org/10.1016/j.media.2009.07.011

      Maderwald, S., Ladd, S.C., Gizewski, E.R., Kraff, O., Theysohn, J.M., Wicklow, K., Moenninghoff, C., Wanke, I., Ladd, M.E., Quick, H.H., 2008. To TOF or not to TOF: strategies for non-contrast-enhanced intracranial MRA at 7 T. Magn Reson Mater Phy 21, 159. https://doi.org/10.1007/s10334-007-0096-9

      Manjón, J.V., Coupé, P., Martí‐Bonmatí, L., Collins, D.L., Robles, M., 2010. Adaptive non-local means denoising of MR images with spatially varying noise levels. Journal of Magnetic Resonance Imaging 31, 192–203. https://doi.org/10.1002/jmri.22003

      Mansfield, P., Harvey, P.R., 1993. Limits to neural stimulation in echo-planar imaging. Magn. Reson. Med. 29, 746–758. https://doi.org/10.1002/mrm.1910290606

      Masaryk, T.J., Modic, M.T., Ross, J.S., Ruggieri, P.M., Laub, G.A., Lenz, G.W., Haacke, E.M., Selman, W.R., Wiznitzer, M., Harik, S.I., 1989. Intracranial circulation: preliminary clinical results with three-dimensional (volume) MR angiography. Radiology 171, 793–799. https://doi.org/10.1148/radiology.171.3.2717754

      Mattern, H., Sciarra, A., Godenschweger, F., Stucht, D., Lüsebrink, F., Rose, G., Speck, O., 2018. Prospective motion correction enables highest resolution time-of-flight angiography at 7T: Prospectively Motion-Corrected TOF Angiography at 7T. Magn. Reson. Med 80, 248–258. https://doi.org/10.1002/mrm.27033

      Mattern, H., Sciarra, A., Lüsebrink, F., Acosta‐Cabronero, J., Speck, O., 2019. Prospective motion correction improves high‐resolution quantitative susceptibility mapping at 7T. Magn. Reson. Med 81, 1605–1619. https://doi.org/10.1002/mrm.27509

      Mennes, M., Jenkinson, M., Valabregue, R., Buitelaar, J.K., Beckmann, C., Smith, S., 2014. Optimizing full-brain coverage in human brain MRI through population distributions of brain size. NeuroImage 98, 513–520. https://doi.org/10.1016/j.neuroimage.2014.04.030 Moccia, S., De Momi, E., El Hadji, S., Mattos, L.S., 2018. Blood vessel segmentation algorithms — Review of methods, datasets and evaluation metrics. Computer Methods and Programs in Biomedicine 158, 71–91. https://doi.org/10.1016/j.cmpb.2018.02.001

      Mustafa, M.A.R., 2016. A data-driven learning approach to image registration. Mut, F., Wright, S., Ascoli, G.A., Cebral, J.R., 2014. Morphometric, geographic, and territorial characterization of brain arterial trees. International Journal for Numerical Methods in Biomedical Engineering 30, 755–766. https://doi.org/10.1002/cnm.2627

      Nagaoka, T., Yoshida, A., 2006. Noninvasive Evaluation of Wall Shear Stress on Retinal Microcirculation in Humans. Invest. Ophthalmol. Vis. Sci. 47, 1113. https://doi.org/10.1167/iovs.05-0218

      Nishimura, D.G., Irarrazabal, P., Meyer, C.H., 1995. A Velocity k-Space Analysis of Flow Effects in Echo-Planar and Spiral Imaging. Magnetic Resonance in Medicine 33, 549–556. https://doi.org/10.1002/mrm.1910330414

      Nishimura, D.G., Jackson, J.I., Pauly, J.M., 1991. On the nature and reduction of the displacement artifact in flow images. Magnetic Resonance in Medicine 22, 481–492. https://doi.org/10.1002/mrm.1910220255

      Nonaka, H., Akima, M., Hatori, T., Nagayama, T., Zhang, Z., Ihara, F., 2003. Microvasculature of the human cerebral white matter: Arteries of the deep white matter. Neuropathology 23, 111–118. https://doi.org/10.1046/j.1440-1789.2003.00486.x

      North, D.O., 1963. An Analysis of the factors which determine signal/noise discrimination in pulsed-carrier systems. Proceedings of the IEEE 51, 1016–1027. https://doi.org/10.1109/PROC.1963.2383

      Park, C.S., Hartung, G., Alaraj, A., Du, X., Charbel, F.T., Linninger, A.A., 2020. Quantification of blood flow patterns in the cerebral arterial circulation of individual (human) subjects. Int J Numer Meth Biomed Engng 36. https://doi.org/10.1002/cnm.3288

      Parker, D.L., Goodrich, K.C., Roberts, J.A., Chapman, B.E., Jeong, E.-K., Kim, S.-E., Tsuruda, J.S., Katzman, G.L., 2003. The need for phase-encoding flow compensation in high-resolution intracranial magnetic resonance angiography. J. Magn. Reson. Imaging 18, 121–127. https://doi.org/10.1002/jmri.10322

      Parker, D.L., Yuan, C., Blatter, D.D., 1991. MR angiography by multiple thin slab 3D acquisition. Magn. Reson. Med. 17, 434–451. https://doi.org/10.1002/mrm.1910170215

      Pauling, L., Coryell, C.D., 1936. The magnetic properties and structure of hemoglobin, oxyhemoglobin and carbonmonoxyhemoglobin. Proceedings of the National Academy of Sciences 22, 210–216. https://doi.org/10.1073/pnas.22.4.210

      Payne, S.J., 2017. Cerebral Blood Flow And Metabolism: A Quantitative Approach. World Scientific. Peters, A.M., Brookes, M.J., Hoogenraad, F.G., Gowland, P.A., Francis, S.T., Morris, P.G., Bowtell, R., 2007. T2* measurements in human brain at 1.5, 3 and 7 T. Magnetic Resonance Imaging 25, 748–753. https://doi.org/10.1016/j.mri.2007.02.014

      Pfeifer, R.A., 1930. Grundlegende Untersuchungen für die Angioarchitektonik des menschlichen Gehirns. Berlin: Julius Springer. Phellan, R., Forkert, N.D., 2017. Comparison of vessel enhancement algorithms applied to time-of-flight MRA images for cerebrovascular segmentation. Medical Physics 44, 5901–5915. https://doi.org/10.1002/mp.12560

      Pohmann, R., Speck, O., Scheffler, K., 2016. Signal-to-Noise Ratio and MR Tissue Parameters in Human Brain Imaging at 3, 7, and 9.4 Tesla Using Current Receive Coil Arrays. Magn. Reson. Med. 75, 801–809. https://doi.org/10.1002/mrm.25677

      Reichenbach, J.R., Venkatesan, R., Schillinger, D.J., Kido, D.K., Haacke, E.M., 1997. Small vessels in the human brain: MR venography with deoxyhemoglobin as an intrinsic contrast agent. Radiology 204, 272–277. https://doi.org/10.1148/radiology.204.1.9205259 Schmid, F., Barrett, M.J.P., Jenny, P., Weber, B., 2019. Vascular density and distribution in neocortex. NeuroImage 197, 792–805. https://doi.org/10.1016/j.neuroimage.2017.06.046

      Schmitter, S., Bock, M., Johst, S., Auerbach, E.J., Uğurbil, K., Moortele, P.-F.V. de, 2012. Contrast enhancement in TOF cerebral angiography at 7 T using saturation and MT pulses under SAR constraints: Impact of VERSE and sparse pulses. Magnetic Resonance in Medicine 68, 188–197. https://doi.org/10.1002/mrm.23226

      Schulz, J., Boyacioglu, R., Norris, D.G., 2016. Multiband multislab 3D time-of-flight magnetic resonance angiography for reduced acquisition time and improved sensitivity. Magn Reson Med 75, 1662–8. https://doi.org/10.1002/mrm.25774

      Shu, C.Y., Sanganahalli, B.G., Coman, D., Herman, P., Hyder, F., 2016. New horizons in neurometabolic and neurovascular coupling from calibrated fMRI, in: Progress in Brain Research. Elsevier, pp. 99–122. https://doi.org/10.1016/bs.pbr.2016.02.003

      Stamm, A.C., Wright, C.L., Knopp, M.V., Schmalbrock, P., Heverhagen, J.T., 2013. Phase contrast and time-of-flight magnetic resonance angiography of the intracerebral arteries at 1.5, 3 and 7 T. Magnetic Resonance Imaging 31, 545–549. https://doi.org/10.1016/j.mri.2012.10.023

      Stewart, A.W., Robinson, S.D., O’Brien, K., Jin, J., Widhalm, G., Hangel, G., Walls, A., Goodwin, J., Eckstein, K., Tourell, M., Morgan, C., Narayanan, A., Barth, M., Bollmann, S., 2022. QSMxT: Robust masking and artifact reduction for quantitative susceptibility mapping. Magnetic Resonance in Medicine 87, 1289–1300. https://doi.org/10.1002/mrm.29048

      Stucht, D., Danishad, K.A., Schulze, P., Godenschweger, F., Zaitsev, M., Speck, O., 2015. Highest Resolution In Vivo Human Brain MRI Using Prospective Motion Correction. PLoS ONE 10, e0133921. https://doi.org/10.1371/journal.pone.0133921

      Szikla, G., Bouvier, G., Hori, T., Petrov, V., 1977. Angiography of the Human Brain Cortex. Springer Berlin Heidelberg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-81145-6

      Triantafyllou, C., Polimeni, J.R., Wald, L.L., 2011. Physiological noise and signal-to-noise ratio in fMRI with multi-channel array coils. NeuroImage 55, 597–606. https://doi.org/10.1016/j.neuroimage.2010.11.084

      Tustison, N.J., Avants, B.B., Cook, P.A., Zheng, Y., Egan, A., Yushkevich, P.A., Gee, J.C., 2010. N4ITK: Improved N3 Bias Correction. IEEE Transactions on Medical Imaging 29, 1310–1320. https://doi.org/10.1109/TMI.2010.2046908

      Uludağ, K., Müller-Bierl, B., Uğurbil, K., 2009. An integrative model for neuronal activity-induced signal changes for gradient and spin echo functional imaging. NeuroImage 48, 150–165. https://doi.org/10.1016/j.neuroimage.2009.05.051

      Venkatesan, R., Haacke, E.M., 1997. Role of high resolution in magnetic resonance (MR) imaging: Applications to MR angiography, intracranial T1-weighted imaging, and image interpolation. International Journal of Imaging Systems and Technology 8, 529–543. https://doi.org/10.1002/(SICI)1098-1098(1997)8:6<529::AID-IMA5>3.0.CO;2-C

      von Morze, C., Xu, D., Purcell, D.D., Hess, C.P., Mukherjee, P., Saloner, D., Kelley, D.A.C., Vigneron, D.B., 2007. Intracranial time-of-flight MR angiography at 7T with comparison to 3T. J. Magn. Reson. Imaging 26, 900–904. https://doi.org/10.1002/jmri.21097

      Ward, P.G.D., Ferris, N.J., Raniga, P., Dowe, D.L., Ng, A.C.L., Barnes, D.G., Egan, G.F., 2018. Combining images and anatomical knowledge to improve automated vein segmentation in MRI. NeuroImage 165, 294–305. https://doi.org/10.1016/j.neuroimage.2017.10.049

      Wilms, G., Bosmans, H., Demaerel, Ph., Marchal, G., 2001. Magnetic resonance angiography of the intracranial vessels. European Journal of Radiology 38, 10–18. https://doi.org/10.1016/S0720-048X(01)00285-6

      Wright, S.N., Kochunov, P., Mut, F., Bergamino, M., Brown, K.M., Mazziotta, J.C., Toga, A.W., Cebral, J.R., Ascoli, G.A., 2013. Digital reconstruction and morphometric analysis of human brain arterial vasculature from magnetic resonance angiography. NeuroImage 82, 170–181. https://doi.org/10.1016/j.neuroimage.2013.05.089

      Yushkevich, P.A., Piven, J., Hazlett, H.C., Smith, R.G., Ho, S., Gee, J.C., Gerig, G., 2006. User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability. NeuroImage 31, 1116–1128. https://doi.org/10.1016/j.neuroimage.2006.01.015

      Zhang, Z., Deng, X., Weng, D., An, J., Zuo, Z., Wang, B., Wei, N., Zhao, J., Xue, R., 2015. Segmented TOF at 7T MRI: Technique and clinical applications. Magnetic Resonance Imaging 33, 1043–1050. https://doi.org/10.1016/j.mri.2015.07.002

      Zhao, J.M., Clingman, C.S., Närväinen, M.J., Kauppinen, R.A., van Zijl, P.C.M., 2007. Oxygenation and hematocrit dependence of transverse relaxation rates of blood at 3T. Magn. Reson. Med. 58, 592–597. https://doi.org/10.1002/mrm.21342

      Zhu, X., Tomanek, B., Sharp, J., 2013. A pixel is an artifact: On the necessity of zero-filling in fourier imaging. Concepts Magn. Reson. 42A, 32–44. https://doi.org/10.1002/cmr.a.21256

    1. Author Response

      Reviewer #2 (Public Review):

      I have only one concern with the study. I am not fully convinced that the disruption of behavioral updating is specifically due to NA signaling within OFC. In the first two studies, they observed non-specific anatomical effect likely due to the ablation of fibers of passage through OFC. The DREADD experiment is claimed to allay this concern. However, the DCZ was injected systemically. This means that any collaterals of LC NA neurons outside OFC will also be suppressed. While the lack of effect with the mPFC projection is interesting, this does not preclude an effect mediated in other target regions. Overall, I believe that none of the experiments truly demonstrate a specific effect of NA in OFC. A few experimental options that could be considered are injection of DCZ directly in OFC, optogenetic inhibition of fibers in OFC, or pharmacological disruption of NA signaling in OFC.

      The other options are to measure the effect of the toxin ablations from experiments 1 and 2 not just in mPFC but in other regions. If the non-specific effect is truly only in mPFC outside of OFC, that would lead to more confidence that mPFC projection is the only other viable pathway mediating the effect.

      As requested, we have quantified the effect of toxin ablations in neighbouring cortical regions known to be involved in the goal directed behavior, namely the insular cortex (IC, e.g., Balleine & Dickinson, 2000; Parkes & Balleine, 2013) the medial orbitofrontal cortex (MO, e.g., Bradfield et al., 2015; Gourley et al., 2016) and secondary motor cortex (M2, Gremel et al., 2016). Briefly, we found that injection of the saporin toxin in the VO and LO (Experiment 1) led to a significant decrease in NA fiber density in all examined regions. Injection of 6-OHDA also produced significant loss of NA fibres in MO and M2 but not insular cortex. These results are presented in Suppl. Figures 1 and 3 (pages 28 and 30) and the statistics are reported in the main text (page 6 and page 11)

      We have also added the following to our discussion on the reason for the off-target depletions that we observed and acknowledged the potential role of collateral LC neurons:

      Page 21, line starting 374: “The use of the saporin toxin led to a dramatic decrease of NA fiber density in all analysed cortical areas (Suppl Fig 1). This may be due to diffusion of the toxin from the injection site, the existence of collateral LC neurons and/or fibers passing through the ventral portion of the OFC but targeting other cortical areas (Cerpa et al 2019). However, injection of 6OHDA led to much less offsite NA depletion suggesting that a large part of the previous observation is toxin-specific. Indeed, no significant loss of NA fibers was visible in the insular cortex, which has been previously implicated in goal-directed behaviour (Balleine & Dickinson, 2000; Parkes et al., 2013; 2015; 2017). We did nevertheless observe an offsite depletion in more proximal prefrontal areas (prelimbic and medial orbitofrontal cortices) albeit a more modest depletion that what was observed using the saporin toxin. Several studies have described the projection pattern of LC cells. These studies, using various techniques, indicate that LC cells mainly target a single region, and that only a small proportion of LC neurons collateralize to minor targets (Plummer et al., 2020, Kebschull et al 2016, Uematsu et al 2017, Chandler et al 2014). Therefore, even if the OFC noradrenergic innervation is presumably specific (Chandler et al 2013), we cannot rule out a possible collateralization of some neurons toward neighbouring prefrontal areas (PL and MO). We have previously discussed that the posterior ventral portion of the OFC is an entry point for LC fibers en passant, which ultimately target other prefrontal areas (Cerpa et al 2019).

      To achieve a greater anatomical selectivity we used a CAV-2 vector carrying the noradrenergic promoter PRS to target either the LC:A32 or the LC:OFC pathways (Hayat et al., 2020; Hirschberg et al., 2017). It has been shown that the CAV-2 vector can infect axons-of-passage, however the vector does not spread more than 200 µm from the injection site (Schwarz et al 2015). Therefore, when targeting the OFC we injected anteriorly to the level where the highest density of fibers of passage is expected (Cerpa et al 2019) in order to minimize infection of such fibers and restrict inhibition to our pathway of interest.

      Overall, the current behavioural results are in line with our previous work showing that the ability to associate new outcomes to previously acquired actions is impaired following chemogenetic inhibition of the VO and LO (Parkes et al., 2018) or disconnection of the VO and LO from the submedius thalamic nucleus (Fresno et al 2019). These results point to a necessary role of the ventral and lateral parts of the OFC and its noradrenergic innervation for updating A-O associations. However, it is worth mentioning that different subregions of the OFC, both along the medio-lateral and antero-posterior axes of OFC, display clear functional heterogeneities (Dalton et al 2016, Izquierdo 2017, Panayi & Killcross, 2018, Bradfield et al 2018, Barreiros et al 2021). Therefore, while we have previously focused on the anatomical heterogeneity of the noradrenergic innervation in these prefrontal subregions (Cerpa et al 2019), a thorough characterization of its functional role in each of these subregions still needs to be addressed.”

      One last concern is that the lack of the effect due to disruption of the mPFC projection is not guaranteed to not be from experimental issues. If the authors have some evidence that the mPFC projection disruption produced some other behavioral effect, that would make the lack of effect in this case more convincing.

      Unfortunately, we do not provide evidence in the current paper that disrupting the LC:mPFC (now termed LC:A32 in the current study, based on the recommendation of reviewer 1) projection produces some other behavioural effect. However, in an on-going series of experiments, using the same tools as the current study, we found that inhibiting the LC:A32, but not LC:OFC, pathway impairs Pavlovian contingency degradation as shown in the figure below. We therefore believe that the failure of LC:mPFC pathway inhibition to effect outcome identity reversal in the present study is not due to experimental issues. Please note that in the figure below mPFC is referred to as area 32 (A32), as requested by reviewer 1.

      Figure 1. A) Experimental timeline for the Pavlovian contingency degradation procedure. Prior to behavioural training, rats were injected with CAV2-PRS-hM4D-mCherry into either the vlOFC or area 32 (A32). Number of food port entries during the non-degraded CS and degraded CS for rats injected with vehicle and rats injected with DCZ during degradation training (B, D) and the test in extinction (C, E). Inhibition of the LC:vlOFC had no effect on Pavlovian contingency degradation, whereas inhibition of LC:A32 during degradation training rendered rats insensitive to the change in the causal relationship between the CS and the US.

      Reviewer #3 (Public Review):

      I would be curious about the authors' thoughts regarding the recent Duan ... Robbins Neuron paper (https://pubmed.ncbi.nlm.nih.gov/34171290/), in which marmosets displayed paradoxical responses to VLO inactivation and stimulation in contingency degradation tasks. Are there ways to reconcile these reports?

      We previously argued that the updating processes underlying changes in causal contingency versus outcome identity may be supported by different prefrontal regions (Cerpa et al., 2021, Behav Neurosci). Unfortunately, the tasks used in the current study do not allow us to test if our rats are sensitive to changes in the action-outcome contingency. In fact, the effect of inactivation (or overactivation) of the ventral and lateral regions of OFC on an instrumental contingency degradation task similar to that used in Duan et al (2022) has not yet been examined in rats.

      Indeed, while it is stated in Duan et al (2022) that rats with lesions of lateral OFC are insensitive to contingency degradation, none of the citations provided support this conclusion (Balleine & Dickinson, 1998; Corbit & Balleine, 2003; Ostlund and Balleine, 2007; Yin et al., 2005). Balleine and Dickinson (1998) assessed the effect of prelimbic and insular cortex lesions (insular anteroposterior coordinate +1.2), with only the former affecting instrumental contingency degradation. Ostlund and Balleine (2007) assessed the effect of orbitofrontal lesions on Pavlovian contingency degradation (degradation of the S-O contingency) not instrumental contingency degradation. Finally, Corbit and Balleine (2003) and Yin et al (2005) assessed the effect of prelimbic and dorsomedial striatum lesions, respectively. Nevertheless, there are some reports on the effect of chemogenetic inhibition of VO/LO on degradation in a nose-poke response task but the results are conflicting (e.g., Whyte et al., 2019; Zimmerman et al., 2017; 2018). It would be very interesting to study the impact of both inactivation and overactivation of VO and LO in rats to compare with the results found in marmosets, using comparable tasks.

      We have added the following to our discussion, which cites Duan et al (2022) and the need to better understand the role of VO and LO in contingency degradation.

      Page 24, line starting 450: “However, it is not yet clear if the NA-OFC system is also involved in detecting the causal relationship between an action and its outcome (see Cerpa et al., 2021 for a discussion). Some have reported impaired adaptation to contingency changes following inhibition of VO and LO or BDNF-knockdown in these regions (Whyte et al., 2019; Zimmerman et al., 2017), while another study shows that inhibition of VO/LO leaves sensitivity to degradation intact, at least during an initial test (Zimmerman et al., 2018). Interestingly, a recent paper in marmosets demonstrates that inactivation of anterior OFC (area 11) improves instrumental contingency degradation, whereas overactivation impairs degradation (Duan et al., 2022). The potential role of the rodent ventral and lateral regions of OFC, and the NA innervation of OFC, in adapting to degradation of instrumental contingencies requires further investigation.”

    1. Author Response:

      Reviewer #1 (Public Review):

      There is growing precedent for the utility of GWAS-type analyses in elucidating otherwise cryptic genotypic associations with specific Mtb phenotypes, most commonly drug resistance. This study represents the latest instalment of this type of approach, utilizing a large set of WGS data from clinical Mtb isolates and refining the search for DR-associated alleles by restricting the set to those predicted (or known) to be phenotypically DR. This revealed a number of potential candidate mutations, including some in nucleotide excision repair (uvrA, uvrB), in base excision repair (mutY), and homologous recombination (recF). In validating these leads functional assays, the authors present evidence supporting the impact of the identified mutations on antibiotic susceptibility in vitro and in macrophage and animal infection models. These results extend the number of candidate mutations associated with Mtb drug resistance, however the following must be considered:

      (i) The GWAS analysis is the basis of this study, yet the description of the approach used and presentation of results obtained is occasionally obscure; for example, the authors report the use of known drug resistance phenotypes (where available) or inferences of drug-resistance from genotypic data to enhance the potential to identify other mutations that might be implicated in enabling the DR mutations, yet their list of known DR mutations seem to be predominantly rare or unusual mutations, not those commonly associated with clinical DR-TB. In addition, the distribution of the identified resistance-associated mutations across the different lineages need to be explained more clearly.

      In the revised manuscript, we have performed the phylogenetic analysis of the strains used. A phylogenetic tree was generated using Mycobacterium canetti as an outgroup (Figure 1b). The phylogeny analysis suggests the clustering of the strains in lineage 1, 2, 3, and 4. Lineages 2,3 and 4 are clustering together, and lineage 1 is monophyletic, as reported previously. The genome sequence data of 2773 clinical strains were downloaded from NCBI. These strains were also part of the GWAS analysis performed by Coll et al (https://pubmed.ncbi.nlm.nih.gov/29358649/) and Manson et al. (https://pubmed.ncbi.nlm.nih.gov/28092681/). The phenotype of the strains used for the association analysis was reported in the previous studies. We have not performed other predictions. The supplementary table provides the lineage origin of each strain used in the study (Supplementary File 1 & 2). The distributions of resistance-associated mutations in different strains is shown (Figure 2-figure supplement 6a-h). As suggested, we have performed an analysis wherein we looked for the direct target mutations that harbor mutations in the DNA repair genes (Figure 2-figure supplement 6i-k).

      We identified mostly the rare mutations due to the following reasons;

      1. We looked for the mutations that were present only in the multidrug resistant strains as compared to the susceptible strains for association mapping. This strategy exclusively gave most variants associated with multidrug resistant phenotype.

      2. We have used Mixed Linear Model (MLM) for association analysis. MLM removes all the population-specific SNPs based on PCA and kinship corrections. The false discovery rate (FDR) adjusted p-values in the GAPIT software are stringent as it corrects the effects of each marker based on the population structure (Q) as well as kinship (K) values. Therefore the probability of identifying the false-positive SNP is very low. We combined it with the Bonferroni corrections to identify markers associated with the drug resistant phenotype.

      (ii) By combining target gene deletions with different complementation alleles, the authors provide compelling microbiological evidence supporting the inferred role of the mutY and uvrB mutations in enhanced survival under antibiotic treatment. The experimental work, however, is limited to assessments of competitive survival in various models, with/without antibiotic selection, or to mutant frequency analyses; there is no direct evidence provided in support of the proposed mechanism.

      To ascertain if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, is indeed due to the acquisition of mutations in the direct target of antibiotics, we performed WGS of the strain from the ex vivo evolution experiment (Figure 5). Genomic DNA extracted from ten independent colonies (grown in vitro), was mixed in equal proportions before library preparation. Only those SNPs present in >20% of reads were retained for the analysis. Analysis of Rv sequences grown in vitro suggested that the laboratory strain has accumulated 100 SNPs compared with the reference strain. The sequence of Rv laboratory strain was used as the reference strain for the subsequent analysis. WGS data for RvDmutY, RvDmutY::mutY, and RvDmutY::mutY-R262Q strains grown in vitro did not show the presence of a mutation in the antibiotic target genes. In a similar vein, ten independent colonies, each from the 7H11-OADC plates, after the final round of ex vivo selection in the presence or absence of antibiotics, were selected for WGS. Data indicated that in the absence of antibiotics, no direct target mutations were identified in the ex vivo passaged strains (Figure 6a & e). In the presence of isoniazid, we found mutations in the katG (Ser315Thr or Ser315Ileu) in the Rv, RvDmutY but not in RvDmutY:mutY and RvDmutY::mutY-R262Q (Figure 6b & e). These findings are in congruence with the ex vivo evolution CFU analysis, wherein we did not observe a significant increase in the survival of RvDmutY and RvDmutY::mutY R262Q in the presence of isoniazid (Figure 5). In the presence of ciprofloxacin and rifampicin, direct target mutations were identified in the gyrA and rpoB (Figure 6c e). Asp94Glu/Asp94Gly mutations were identified in gyrA, and, His445Tyr/Ser450Leu mutations were identified in rpoB of RvDmutY and RvDmutY::mutY-R262Q, respectively. No direct target mutations were identified in the Rv and RvDmutY::mutY, suggesting that the perturbed DNA repair aids in acquiring the drug resistance-conferring mutations in Mtb (Figure 6c-e & Supplementary File 8).

      To determine if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, in the guinea pig infection experiment (Figure 8) is due to the accumulation of mutations in the host, we performed WGS of the strain isolated from guinea pig lungs. Analysis revealed specific genes such as cobQ1, smc, espI, and valS were mutated only in RvDmutY and RvDmutY::mutYR262Q but not in Rv and RvDmutY::mutY. Besides, tcrA and gatA were mutated only in RvDmutY, whereas rv0746 were mutated exclusively in the RvDmutY:mutY (Figure 8-Figure Supplement 2). However, we did not observe any direct target mutations; this may be because guinea pigs were not subjected to antibiotic treatment. Data suggests that the continued longterm selection pressure is necessary for bacilli to acquire mutations.

      (iii) The low drug concentrations used (especially of rifampicin against M. smegmatis) suggest the identified mutations confer low-level resistance to multiple antimycobacterial agents - in turn implying tolerance rather than resistance. If correct, it would be interesting to know how broadly tolerant strains containing these mutations are; that is, whether susceptibility is decreased to a broad range of antibiotics with different mechanisms of action (including both cidal and static agents), and whether the extent of the decrease be determined quantitatively (for example, as change in MIC value).

      To evaluate the effect of different drugs on the survival of RvDmutY or RvDmutY::mutYR262Q, we performed killing kinetics in the presence and absence of isoniazid, rifampicin, ciprofloxacin, and ethambutol (Figure 4a). In the absence of antibiotics, the growth kinetics of Rv, RvDmutY, RvDmutY:mutY, and RvDmutY::mutY-R262Q were similar (Figure 4b). In the presence of isoniazid, ~2 log-fold decreases in bacterial survival was observed on day 3 in Rv and RvDmutY:mutY; however, in RvDmutY and RvDmutY::mutY-R262Q, the difference was limited to ~1.5 log-fold (Figure 4c). A similar trend was apparent on days 6 and 9, suggesting a ~5-fold increase in the survival of RvDmutY and RvDmutY::mutY-R262Q compared with Rv and RvDmutY:mutY (Figure 4c). Interestingly, in the presence of ethambutol, we did not observe any significant difference (Figure 4d). In the presence of rifampicin and ciprofloxacin, we observed a ~10-fold increase in the survival of RvDmutY and RvDmutY::mutY-R262Q compared with Rv and RvDmutY:mutY (Figure 4e-f). Thus results suggest that the absence of mutY or the presence of mutY variant aids in subverting the antibiotic stress.

      Reviewer #2 (Public Review):

      This interesting manuscript uses a collection of whole genome sequences of TB isolates to associate specific sequence polymorphisms with MDR/XDR strains, and having found certain mutations in DNA repair pathways, does a detailed analysis of several mutations. The evaluation of the MutY polymorphism reveals it is loss of function and TB strains carrying this mutation have a higher mutation frequency and enhanced survival in serial passage in macrophages. The strengths of the manuscript are the leveraging of a large sequence dataset to derive interesting candidate mutations in DNA repair pathway and the demonstration that at least one of these mutations has a detectable effect on mutagenicity and pathogenesis. The weaknesses of the manuscript are a lack of experimental exploration of the mechanism by which loss of a DNA repair pathway would enhance survival in vivo. The model presented is that these phenotypes are due to hypermutagenicity and thereby evolution of enhanced pathogenesis, but this is not actually directly tested or investigated. There are also some technical concerns for some of the experimental data which can be strengthened.

      This paper presents the following data:

      • Analyzed whole-genome sequences 2773 clinical strains: 160 000 SNPs identified
      • 1815 drug-susceptible/422 MDR/XDR strains: 188 mutations correlated with Drug resistance.
      • Novel mutations associated with the drug resistance have been found in base excision repair (BER), nucleotide excision repair (NER), and homologous recombination (HR) pathway genes (mutY, uvrA, uvrB, and recF).
      • Specific mutations mutY-R262Q and uvrB-A524V were studied.
      • mutY-R262Q and uvrB-A524V mutations behave as loss of function alleles in vivo, as measured by non-complementation of the increased mutation frequency measured by resistance to Rif and INH.
      • The mutY deletion and the mutY-R262Q mutation increase Mtb survival over WT in macrophages when Mtb has not been submitted to previous rounds of macrophage infection.
      • This advantage is exacerbated in presence of antibiotic (Rif and Cipro but not INH).
      • The MutY deletion and the MutY-R262Q mutation result in an enhanced survival of Mtb during guinea pig infection.

      Major issues:

      The finding that mutations in MutY confers an advantage during macrophage infection is convincing based on the macrophage experiments, but it is premature to conclude that the mechanism of this effect is due to hypermutagenesis and selection of fitter bacterial clones. It is described in E. coli (Foti et al., 2012) and recently in mycobacteria (Dupuy et al., 2020) that the MutY/MutM excision pathways can increase the lethality of antibiotic treatment because of double-strand breaks caused by Adenine/oxoG excisions. The higher survival of the mutY mutant during antibiotic treatment could more be due to lower Adenine/oxoG excision in the mutant rather than acquisition of advantageous mutations, or some other mechanism. The same hypothesis cannot be excluded for the Guinea pig experiments (no antibiotics, but oxidative stress mediated by host defenses could also increase oxoG) and should at least be discussed. Experiments that would support the idea that the in vivo advantage is due to hypermutagenesis would be whole genome sequencing of the output vs input populations to directly document increased mutagenesis. Similarly, is the ΔmutY survival advantage after rounds of macrophage infections dependent on macrophage environment? What happens if the ΔmutY strain is cultivated in vitro in 7H9 (same number of generations) before infecting macrophages?

      We thank the reviewer for the insightful comments. To ascertain if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, is indeed due to the acquisition of mutations in the direct target of antibiotics, we performed WGS of the strain from the ex vivo evolution experiment (Figure 5). Genomic DNA extracted from ten independent colonies (grown in vitro) was mixed in equal proportion prior to library preparation. For the analysis, only those SNPs that were present in >20% of reads were retained. Analysis of Rv sequences grown in vitro suggested that the laboratory strain has accumulated 100 SNPs compared with the reference strain. The sequence of the Rv laboratory strain was used as the reference strain for the subsequent analysis. WGS data for RvDmutY, RvDmutY::mutY, and RvDmutY::mutY-R262Q strains grown in vitro did not show the presence of a mutation in the antibiotic target genes. In a similar vein, ten independent colonies, each from the 7H11-OADC plates, after the final round of ex vivo selection in the presence or absence of antibiotics, were selected for WGS. Data indicated that in the absence of antibiotic, no direct target mutations were identified in the ex vivo passaged strains (Figure 6a & e). In the presence of isoniazid, we found mutations in the katG (Ser315Thr or Ser315Ileu) in the Rv, RvDmutY but not in RvDmutY:mutY and RvDmutY::mutY-R262Q (Figure 6b & e). These findings are in congruence with the ex vivo evolution CFU analysis, wherein we did not observe a significant increase in the survival of RvDmutY and RvDmutY::mutY R262Q in the presence of isoniazid (Figure 5). In the presence of ciprofloxacin and rifampicin, direct target mutations were identified in the gyrA and rpoB (Figure 6c-e). Asp94Glu/Asp94Gly mutations were identified in gyrA, and, His445Tyr/Ser450Leu mutations were identified in rpoB of RvDmutY and RvDmutY::mutY-R262Q, respectively. No direct target mutations were identified in the Rv and RvDmutY::mutY, suggesting that the perturbed DNA repair aids in acquiring the drug resistance-conferring mutations in Mtb (Figure 6c-e & Supplementary File 8).

      To determine if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, in the guinea pig infection experiment (Figure 8) is due to the accumulation of mutations in the host, we performed WGS of the strain isolated from guinea pig lungs. Analysis revealed specific genes such as cobQ1, smc, espI, and valS were mutated only in RvDmutY and RvDmutY::mutYR262Q but not in Rv and RvDmutY::mutY. Besides, tcrA and gatA were mutated only in RvDmutY, whereas rv0746 were mutated exclusively in the RvDmutY:mutY (Figure 8-figure supplement 2). However, we did not observe any direct target mutations; this may be because guinea pigs were not subjected to antibiotic treatment. Data suggests that the continued longterm selection pressure is necessary for bacilli to acquire mutations.

      • It would be useful to present more data about the strain relatedness and genome characteristics of the DNA repair mutant strains in the GWAS. For example, the model would suggest that strains carrying DNA repair mutations should have higher SNP load than control strains. Additionally, it would be helpful to know whether the identified DNA repair pathway mutations are from epidemiologically linked strains in the collection to deduce whether these events are arising repeatedly or are a founder effect of a single mutant since for each mutation, the number of strains is small.

      We analyzed the genome of the clinical strains that possess DNA repair gene mutations to determine the additional polymorphisms. The number of SNPs in the strains harboring DNA repair mutation and the drug susceptible strains appears to be similar. The marginal difference, if any were not statistically significant.

      We agree with the reviewer that these strains might be epidemiologically linked. In the present study, all the strains harboring mutation in mutY belong to lineage 4. We observed that all the mutY mutationcontaining strains were either MDR or pre-XDR compared with drug susceptible strains of the same clade.

      • Some of the mutation frequency, survival and competition data could be strengthened by more experimental replicates. Data Lines 370-372 (mutation frequency), lines 387-388 (Survival of strains ex vivo), line 394 (competition experiment) : "Two biologically independent experiments were performed. Each experiment was performed in technical triplicates. Data represent one of the two biological experiments." Two biological replicates is insufficient for the phenotypes presented and all replicates should be included in the analysis. In addition, the definition of "technical triplicates" should be given, does this mean the same culture sampled in triplicate?

      We thank the reviewer for the comment. We performed at least two independent experiments with biological triplicates (not technical triplicates). We apologize for writing this incorrectly. We have reported data from one independent experiment consisting of at least biological triplicates. For mutation rate analysis, we have performed experiment using six independent colonies. These points are mentioned in the methods and legends of the revised manuscript.

      • MutY phenotypes. One caveat to the conclusion that the MutY R262Q mutant is nonfunctional is the lack of examination of the expression of the complementing protein. I would be informative to comment on the location of this mutation in relation to the known structures of MutY proteins. Similarly, for the UvrB polymorphism, this null strain has a clear UV sensitivity phenotype in the literature, so a fuller interrogation for UV killing would be informative re: the A524V mutation.

      We have now included the western blot data on both complementation strains (Figure 3-figure supplement 1). We agree with the reviewer that the uvrB null mutant may have UV sensitivity phenotype, but we have not performed the experiment in the present study.

      Reviewer #3 (Public Review):

      STRENGTHS

      • This ambitious study is broad in scope, beginning with a bacterial GWAS study and extending all the way to in vivo guinea pig infection models.

      • Numerous reports have attempted to identify Mtb strains with elevated mutation rates, and the results are conflicting. The present study sets out to thoroughly evaluate one such mutation that may produce a mutator phenotype, mutY-Arg262Gln.

      WEAKNESSES

      • While the authors follow-up experiments with the mutY-Arg262Gln allele are all consistent with the conclusion that this mutation elevates the mutation rate in Mtb and thus could promote the evolution of drug resistance, further work is needed to unambiguously demonstrate this link.

      • The authors highlight five mutations in genes associated with DNA replication and or repair from their GWAS analysis:

      o dnaA-Arg233Gln: as the authors note in the Discussion, Hicks et al. associate SNPs in dnaA with low-level isoniazid resistance, as a result of lowered katG expression. Since this is unrelated to their focus on DNA repair genes whose mutation could elevate mutation rates, I would consider removing this allele from the Table.

      As suggested, we have removed the dnaA from Table 3.

      o mutY-Arg262Gln: querying publicly available whole genome sequences of clinical Mtb isolates, this SNP appears to be restricted to lineage 4.3 (L4.3). All of these L4.3 strains appear to be drug-resistant. How many times did the mutY-Arg262Gln mutation evolve in the authors dataset? If there is evidence of homoplastic evolution, this would strengthen their case. If not, it doesn't mean the authors findings are incorrect, but does elevate that risk that this mutation could be a passenger (i.e. not driver) mutation. To address this, the authors could attempt to date when the mutY-Arg262Gln arose. If it was before the evolution of drug-resistance conferring alleles in these L4.3 strains, that is consistent with (but not proof of) a driver mutation. If mutY-Arg262Gln arose after, this is much more consistent with a passenger mutation.

      As pointed out by the reviewer, the mutY-Arg262Gln mutation is restricted to lineage 4. We have checked the mutY gene sequence from the strains harboring mutY Arg262Gln mutation and sensitive strains of the same clade. We identified only the reported mutation in the drug-resistant strains, and there was no synonymous mutation that could be used for performing molecular clock analysis. To ascertain whether it is a passenger or a driver mutation, we have performed multiple experiments that suggest that identified mutation aids in the acquisition of drug resistance.

      o uvrB-Ala524Val: curiously we don't see this SNP in our dataset of publicly available whole genome sequences of clinical Mtb isolates (~45,000 genomes).

      We have rechecked this SNP in our dataset. This SNP was present in 87 drug-resistant strains that belong to lineage 2.

      o uvrA-Gln135Lys: this SNP also appears to be restricted to lineage 4.3. Same question as for mutY-Arg262Gln.

      As pointed out by the reviewer, uvrA-Gln135lys mutation is restricted to lineage 4. We identified only the reported mutation in the drug-resistant strains, and there was no synonymous mutation that can be used for performing molecular clock analysis

      o recF-Gly269Gly: this is a very common mutation, is it unique to lineage 2.2.1? Same question as for mutY-Arg262Gln.

      RecF-Gly269Gly mutation was present in the lineage 2 strains. Here also, we identified only the reported mutation in the drug-resistant strains, and there was no synonymous mutation could be used for performing molecular clock analysis.

      • The CRYPTIC consortium recently published a number of preprints on biorxiv detailing very large GWAS studies in Mtb. Did any of these reports also associate drug resistance with mutY? If yes, this should be stated. If not, the potential reasons for this discrepancy should be discussed.

      We have checked the recently published CRYPTIC consortium article (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001721#sec012) for mutY-Arg262Gln. We did not find the mutY-Arg262Gln mutation in their analysis; this is due to the different strains used in the study. However, we identified recF Gly269Gly mutation in their datase

      • Based on the authors follow-up studies in vivo, MutY-Arg262Gln is presumed to be a loss-of-function allele. If the authors could convincingly demonstrate this biochemically with recombinant proteins, this would significantly strengthen their case.

      Experiments performed in Msm and Mtb mutant strains suggest that MutY variant is a loss-of-function allele. We have not performed in vitro assays to confirm the same.

      • If the authors are correct and mutY-Arg262Gln strains have elevated mutation rates, presumably there would be evidence of this in the clinical strain sequencing data. Do mutY-Arg262Gln containing strains have elevated C→G or C→A mutations in their genomes? Presumably such strains would also have a higher number of SNPs than closely related strains WT for mutY- is this the case?

      We analyzed the genome of the clinical strains that possess DNA repair gene mutations to determine the additional polymorphisms. The number of SNPs in the strains harboring DNA repair mutation and the drug susceptible strains appears to be higher. We have also looked for the CàT and CàG mutations in the same strains. CàT mutations are higher in the strains harboring mutY variant compared with the susceptible strains (Figure 2-figure supplement 6 l). However, we could not perform statistical analysis as the number of strains that harbor mutY variant is limited to 8. Thus data suggest that empirically the strains harboring mutY variant show higher SNPs elsewhere and CàT mutations. We are not stating these conclusions strongly in the manuscript as the data is not statistically significant

      • While more work, mutation rates as measured by Luria-Delbruck fluctuation analysis are more accurate than mutation frequencies. I would recommend repeating key experiments by Luria-Delbruck fluctuation analysis. It is also important to report both drug-resistant colony counts and total CFU in these sorts of experiments. Given the clumpy nature of mycobacteria, mutation rates can appear to be artificially elevated due to low total CFU and not an increase in the number of drug-resistant colonies.

      As suggested, we determined the mutation rate in the presence of isoniazid, rifampicin, and ciprofloxacin (Figure 3g-j). The fold increase in the mutation rate relative to Rv for RvDmutY, RvDmutY:mutY, and RvDmutY::mutY-R262Q was 2.90, 0.76, and 3.0 in the presence of isoniazid and 5.62, 1.13, and 5.10 or 9.14, 1.57, and 8.71 in the presence of rifampicin and ciprofloxacin respectively (Figure 3).

      • Figure 4 would appear to measuring drug tolerance not resistance? Are the elevated CFU in the presence of drugs in the mutY-Arg262Gln strain due to an increase in the number of drug resistant strains or drug sensitive strains? This could be assessed by quantifying resulting CFU in the presence or absence the indicated drugs.

      To ascertain better survival is due to the acquisition of mutations in the direct target of antibiotics or drug tolerance. We performed WGS of the strain from the ex vivo evolution experiment (Figure 5). Genomic DNA extracted from ten independent colonies (grown in vitro) was mixed in equal proportion prior to library preparation. Only those SNPs present in >20% of reads were retained for the analysis. Analysis of Rv sequences grown in vitro suggested that the laboratory strain has accumulated 100 SNPs compared with the reference strain. The sequence of the Rv laboratory strain was used as the reference strain for the subsequent analysis. WGS data for RvDmutY, RvDmutY::mutY, and RvDmutY::mutY-R262Q strains grown in vitro did not show the presence of a mutation in the antibiotic target genes. In a similar vein, ten independent colonies, each from the 7H11-OADC plates, after the final round of ex vivo selection in the presence or absence of antibiotics, were selected for WGS. Data indicated that in the absence of antibiotics, no direct target mutations were identified in the ex vivo passaged strains (Figure 6a & e). In the presence of isoniazid, we found mutations in the katG (Ser315Thr or Ser315Ileu) in the Rv, RvDmutY but not in RvDmutY::mutY and RvDmutY::mutY-R262Q (Figure 6b & e). These findings are in congruence with the ex vivo evolution CFU analysis, wherein we did not observe a significant increase in the survival of RvDmutY and RvDmutY::mutY-R262Q in the presence of isoniazid (Figure 5). In the presence of ciprofloxacin and rifampicin, direct target mutations were identified in the gyrA and rpoB (Figure 6c-e). Asp94Glu/Asp94Gly mutations were identified in gyrA, and, His445Tyr/Ser450Leu mutations were identified in rpoB of RvDmutY and RvDmutY::mutY-R262Q, respectively. No direct target mutations were identified in the Rv and RvDmutY::mutY, suggesting that the perturbed DNA repair aids in acquiring the drug resistance-conferring mutations in Mtb (Figure 6c-e & Supplementary File 8).

      To determine if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, in the guinea pig infection experiment (Figure 8) is due to the accumulation of mutations in the host, we performed WGS of the strain isolated from guinea pig lungs. Analysis revealed specific genes such as cobQ1, smc, espI, and valS were mutated only in RvDmutY and RvDmutY::mutYR262Q but not in Rv and RvDmutY::mutY. Besides, tcrA and gatA were mutated only in RvDmutY, whereas rv0746 were mutated exclusively in the RvDmutY::mutY (Figure 2-figure supplement 6). However, we did not observe any direct target mutations; this may be because guinea pigs were not subjected to antibiotic treatment. Data suggests that the continued longterm selection pressure is necessary for bacilli to acquire mutations.

    1. Author Response

      Reviewer #1 (Public Review):

      The data support the claims, and the manuscript does not have significant weaknesses in its present form. Key strengths of the paper include using a creative HR-based reporter system combining different inducible DSB positions along a chromosome arm and testing plasmid-based and chromosomal donor sequences. Combining that system with the visualization of specific chromosomal sites via microscopy is powerful. Overall, this work will constitute a timely and helpful contribution to the field of DSB/genome mobility in DNA repair, especially in yeast, and may inform similar mechanisms in other organisms. Importantly, this study also reconciles some of the apparent contradictions in the field.

      We thank the reviewer for these positive comments on the quality of the THRIV system, in helping us to understand global mobility and to reconcile the different studies in the field. The possibility that these mobilities also exist in other organisms is attractive because they could be a way to anticipate the position of the damage in the genome and its possible outcome.

      Reviewer #2 (Public Review):

      The authors are clarifying the role of global mobility in homologous recombination (HR). Global mobility is positively correlated with recombinant product formation in some reports. However, some studies argue the contrary and report that global mobility is not essential for HR. To characterize the role of global chromatin mobility during HR, the authors set up a system in haploid yeast cells that allows simultaneously tracking of HR at the single-cell level and allows the analysis of different positions of the DSB induction. By moving the position of the DSB within their system, the authors postulate that the chromosomal conformation surrounding a DNA break affects the global mobility response. Finally, the authors assessed the contributions of H2A(X) phosphorylation, checkpoint progression and Rad51 in the mobility response.

      One of the strengths of the manuscript is the development of "THRIV" as an efficient method for tracking homologous recombination in vivo. The authors take advantage of the power of yeast genetics and use gene deletions and as well as mutations to test the contribution of H2A(X) phosphorylation, checkpoint progression and Rad51 to the mobility response in their THRIV system.

      A major weakness in the manuscript is the lack of a marker to indicate that DSB formation has occurred (or is occurring)? Although at 6 hours there is 80% I-SceI cutting, around 20% of the cells are uncut and cannot be distinguished from the ones that are cut (or have already been repaired). Thus, the MSD analysis is done in the blind with respect to cells actually undergoing DSB repair.

      The authors clearly outlined their aims and have substantial evidence to support their conclusions. They discovered new features of global mobility that may clear up some of the controversies in the field. They overinterpreted some of their observations, but these criticisms can be easily addressed.

      The authors addressed conflicting results concerning the importance of global mobility to HR and their results aid in reconciling some of the controversies in the field. A key strength of this manuscript is the analysis of global mobility in response to breaks at different locations within chromosomes? They identified two types of DSB-induced global chromatin mobility involved in HR and postulate that they differ based on the position of the DSB. For example, DSBs close to the centromere exhibit increased global mobility that is not essential for repair and depends solely on H2A(X) phosphorylation. However, if the DSB is far away from the centromere, then global mobility is essential for HR and is dependent on H2A(X) phosphorylation, checkpoint progression as well as the Rad51 recombinase.

      The Bloom lab had previously identified differences in mobility based on the position of the tracked site. However, in the study reported here, the mobility response is analyzed after inducing DSBs located at different positions along the chromosome.

      They also addressed the question of the importance of the Rad51 protein in increased global mobility in haploid cells. Previous studies used DNA damaging agents that induce DSBs randomly throughout the genome, where it would have been rare to induce DSBs near the centromere. In the studies reported in this manuscript, they find no increase in global mobility in a rad51∆ background for breaks induced near the centromere (proximal), but find that breaks induced near the telomeres (distal), are dependent on both gamma-H2A(X) spreading and the Rad51 recombinase.

      We thank the referee for his constructive comments on the strength of our system to accurately determine the impact of a DSB according to its position in the genome. Concerning the issue of damaged cells that were not detected, it is a very important and exciting issue because it confronts our data with the question of biological heterogeneity. We provide evidence on the consistency of our findings despite the lack of detection of undamaged cells.

      Reviewer #3 (Public Review):

      In this study, Garcia Fernandez et al. employ a variety of genetic constructs to define the mechanism underlying the global chromatin mobility elicited in response to a single DNA double-strand break (DSB). Such local and global chromatin mobility increases have been described a decade ago by the Gasser and Rothstein laboratories, and a number of determinants have been identified: one epistasis group results in H2A-S129 phosphorylation via Rad9 and Mec1 activation. The mechanism is thought to be due to chromatin rigidification (Herbert 2017; Miné-Hattab 2017) or general eviction of histones (Cheblal 2020). More enigmatic, global chromatin mobility increase also depends on Rad51, a central recombination protein downstream of checkpoint activation (Smith & Rothstein 2017), which is also required for local DSB mobility (Dion .. Gasser 2012). The authors set out to address this difficulty in the field.

      A premise of their study is the convergence of two types of observations: First, the H2A phosphorylation ChIP profile matches that of Rad51, with both spreading in trans on other chromosomes at the level of centromeres when a DSB occurs in the vicinity of one of them (Renkawitz 2014). Second, global mobility depends on H2A phosphorylation and on Rad51 (their previous study Herbert 2017). They thus address whether the Rad51-ssDNA filament (and associated proteins) marks the chromatin engaged during the homology search. They found that the extent of the mobility depends on the residency time of the filament in a particular genomic and nuclear region, which can be induced at an initially distant trans site by providing a region of homology. Unfortunately, these findings are not clearly apparent from the title and the abstract, and in fact somewhat misrepresented in the manuscript, which would call for a rewrite (see points below).

      The main goal of our study was to understand the role of global mobility in the repair by homologous recombination, depending on the location of the damage. We found distinct global mobility mechanisms, in particular in the involvement of the Rad51 nucleofilament, depending on whether the DSB was pericentromeric or not. It is thus likely that when the DSB is far from the pericentromere, the residence time of the Rad51 nucleofilament with the donor has an impact on global mobility. Thus, if our experiments were not designed to answer directly the question of the residence time of the nucleofilament, we now discuss in more detail the causes and consequences of the global mobility.

      To this end, they induce the formation of a site-specific DSB in either of two regions: a centromere-proximal region and a telomere-proximal region, and measure the mobility of an undamaged site near the centromere on another chromosome (with a LacO-LacI-GFP system). This system reveals that only the centromere-proximal DSB induces the mobility of the centromere-proximal undamaged site, in a Rad9- and Rad51-independent manner. Providing a homologous donor in the vicinity of the LacO array (albeit in trans) restores its mobility when the DSB is located in a subtelomeric region, in a Rad9- and Rad51-dependent fashion. These genetic requirements are the same as those described for local DSB mobility (Dion & Gasser 2012), drawing a link between the two types of mobility, which to my knowledge was not described. The authors should focus their message (too scattered in the current manuscript), on these key findings and the diffusive "painting" model, in which the canvas is H2A, the moving paintbrush Mec1, and the hand the Rad51-ssDNA filament whose movement depends on Rad9. In the absence of Rad51-Rad9 the hand stays still, only decorating H2A in its immediate environment. The amount of paint deposited depends on the residency time of the Rad51-ssDNA-Mec1 filament in a given nuclear region. This synthesis is in agreement with the data presented and contrasts with their proposal that "two types of global mobility" exist.

      The brush model is very useful in explaining the distal mobility, which indeed is linked to local mobility genetic requirements, but it is also helpful to think of different model than the brush model when pericentromeric damage occurs. To stay in the terms of painting technique, this model would be similar to the pouring technique, when oil paint is deposited on water and spreads in a multidirectional manner. It is likely that Mec1 or Tel1 are the factors responsible for this spreading pattern. We therefore propose to maintain the notion of two distinct types of mobilities. Without going into pictorial techniques in the text, we have attempted to clarify these two models in the manuscript.

      The rest of the manuscript attempts to define a role in DSB repair of this phosphor-H2A-dependent mobility, using a fluorescence recovery assay upon DSB repair. They correlate a defect in the centromere-proximal mobility (in the rad9 or h2a-s129a mutant) when a DSB is distantly induced in the subtelomere with a defect in repairing the DSB. Repair efficiency is not affected by these mutations when the donor is located initially close to the DSB site. This part is less convincing, as repair failure specifically at a distant donor in the rad9 and H2A-S129A mutants may result from other defects relating to chromatin than its mobility (i.e. affecting homology sampling, DNA strand invasion, D-loop extension, D-loop disruption, etc), which could be partially alleviated by repeated DSB-donor encounters when the two are spatially close. In fact, suggesting that undamaged site mobility is required for the early step of the homology search directly contradicts the fact that the centromere-proximal mobility induced by a subtelomeric DSB depends on the presence of a donor near the centromere: mobility is thus a product of homology identification and increased Rad51-ssDNA filament residency in the vicinity of the centromere, and so downstream of homology search. This is a major pitfall in their interpretation and model.

      We thank the referee for helping to clarify the question of the cause and consequence of global mobility. As he pointed out, the fact that a donor is required to observe both H2A phosphorylation and distal mobility implicates the recombination process itself, as well as the residence time of the Rad51 nucleofilament, in the ƴ--‐H2A(X) spreading and indicates that recombination would be the cause of distal mobility. In contrast, the fact that proximal mobility can exist independently of homologous recombination suggests that in this particular configuration, HR would then be a consequence of proximal mobility.

      In conclusion, I think the data presented are of importance, as they identify a link between local and global chromatin mobility. The authors should rewrite their manuscript and reorganize the figures to focus on the painter model that their data support. I propose experiments that will help bolster the manuscript conclusions.

      1) Attempt dual-color tracking of the DSB (i.e. Rad52-mCherry or Ddc1-mCherry) and the donor site, and track MSD as a function of proximity between the DSB and the Lac array (with DSB +/-dCen). The expectation is that only upon contact (or after getting in close range) should the MSD at the centromere-proximal LacO array increase with a DSB at a subtelomere. Furthermore, this approach will help distinguish MSDs in cells bearing a DSB (Rad52 foci) from undamaged ones (no Rad52 foci)(see Mine-Hattab & Rothstein 2012). This would help overcome the inefficient DSB induction of their system (less than 50% at 1 hr post-galactose addition, and reaching 80% at 6 hr). For the reader to have a better appreciation of the data distribution, replace the whisker plots of MSD at 10 seconds with either scatter dot plot or violin plots, whichever conveys most clearly the distribution of the data: indeed, a bimodal distribution is expected in the current data, with undamaged cells having lower, and damaged cells having higher MSDs.

      The reviewer raises two points here.

      The first point concerns the residence time of the Rad51 filament with the donor when a subtelomeric DSB happens. Measuring the DSBs as a function of the distance between donor and Rad52mCherry (or Ddc1--‐mCherry) would allow deciding on the cause or the consequence of the global mobility. Thus, if mobility is the consequence of (stochastic) contact, leading to a better efficiency of homologous recombination, we would see an increase in MSDs only when the distance between donor and filament would be small. Conversely, if global mobility is the cause of contact, the increase in mobility would be visible even when the distance between donor and filament is large. It would be necessary to have a labelling system with 3 different fluorophores — the one for the global mobility, the one for the donor and the one allowing following the filament. This triple labelling is still to be developed.

      The second point concerns the important question of the heterogeneity of a population, a central challenge in biology. Here we wish to distinguish between undamaged and damaged cells. Even if a selection of the damaged cells had been made, this would not solve entirely the inherent cell to cell variation: at a given time, it is possible that a cell, although damaged, moves little and conversely that a cell moves more, even if not damaged. The question of heterogeneity is therefore important and the subject of intense research that goes beyond the framework of our work (Altschuler and Wu, 2010). However, in order to start to clarify if a bias could exist when considering a mixed population (20% undamaged and 80% damaged), we analyzed MSDs, using a scatter plot. We considered two population of cells where the damage is the best controlled, i.e. i) the red population which we know has been repaired and, importantly, has lost the cut site and will be not cut again (undamaged--‐only population) and ii) the white population, blocked in G2/M, because it is damaged and not repaired (damaged--‐only population). These two populations show very significant differences in their median MSDs. We artificially mixed the MSDs values obtained from these two populations at a rate of 20% of undamaged--‐only cells and 80% of damaged--‐only cells. We observed that the mean MSDs of the damaged--‐only and undamaged--‐only cells were significantly different. Yet, the mean MSD of damaged--‐only cells was not statistically different from the mean MSD from the 20%--‐80% mixed cell population. Thus, the conclusions based on the average MSDs of all cells remain consistent.

      Scatter plot showing the MSD at 10 seconds of the damaged-­‐only population (in white), the repaired-­‐only population (in red), or the 20%-­‐80% mixed population

      2) Perform the phospho-H2A ChIP-qPCR in the C and S strains in the absence of Rad51 and Rad9, to strengthen the painter model.

      ChIP experiments in mutant backgrounds as well as phosphorylation/dephosphorylation kinetics would corroborate the mobility data described here, but are beyond the scope of this manuscript. Yet, a phospho--‐ H2A ChIP experiment was performed in a Δrad51 mutant in Renkawitz et al. 2013. In that case, γH2A propagation was restricted only to the region around the DSB, corroborating both the requirement for Rad51 in distal mobility and the lack of requirement for Rad51 in proximal mobility.

      3) Their data at least partly run against previously published results, or fail to account for them. For instance, it is hard to see how their model (or the painter model), could explain the constitutively activated global mobility increase observed by Smith .. Rothstein 2018 in a rad51 rad52 mutant. Furthermore, the gasser lab linked the increased chromatin mobility to a general loss of histones genome-wide, which would be inconsistent with the more localized mechanism proposed here. Do they represent an independent mechanism? These conflicting observations need to be discussed in detail.

      Apart from the fact that the mechanisms in place in a haploid or a diploid cell are not necessarily comparable, it is not clear to us that our data are inconsistent with that of Smith et al. (Smith et al., 2018). Indeed, it is not known by which mechanisms the increase in global mobility is constitutively activated in a Δrad51 Δrad52 mutant. But according to their hypothesis the induction of a checkpoint is likely and so is the phosphorylation of H2A. It would be interesting to verify γH2A in such a context. This question is now mentioned in the main text.

      Concerning histone loss, it appears to be different depending on the number of DSBs. Upon multiple DNA damage following genotoxic treatment with Zeocin, Susan Gasser's group has clearly established that nucleosome loss occurs (Cheblal et al., 2020; Hauer et al., 2017). Nucleosome loss, like H2A phosphorylation as we have shown (Garcia Fernandez et al., 2021; Herbert et al., 2017), leads to increased global mobility. The state of chromatin following these histone losses or modifications is not yet fully understood, but could coexist. In the case of a single DSB by HO, it is the local mobility of the MAT locus that is examined (Fig3B in (Cheblal et al., 2020). In this case, the increase in mobility is indeed dependent on Arp8 which controls histone degradation and correlates with a polymer pattern consistent with normal chromatin. It is likely that histone degradation occurs locally when a single DSB occurs. Concerning histone loss genome wide, the question remains open. If histone eviction nevertheless occurred globally upon a single DSB, both types of modifications could be possible. This aspect is now mentioned in the discussion.

    1. Author Response:

      Reviewer #3 (Public Review):

      INaR is related to an alternative inactivation mode of voltage activated sodium channels. It was suggested that an intracellular charged particle blocks the sodium channel alpha subunit from the intracellular space in addition to the canonical fast inactivation pathway. Putative particles revealed were sodium channel beta4 subunit and Fibroblast growth factor 14. However, abolishing the expression of neither protein does eliminate INaR. Therefore as recently suggested by several authors it is conceivable that INaR is not mediated by a particle driven mechanism at all. Instead, these and other proteins might bind to the pore forming alpha subunit and endow it with an alternative inactivation pathway as envisioned in this paper by the authors.

      The main experimental findings were (1) The amplitude of INaR is independent of the voltage of the preceding step. (2) The peak amplitudes of INaR are dependent on the time of the depolarizing step but independent of the sodium driving force. (3) INaT and INaR are differential sensitive to recovery from inactivation. According to their experimental data the authors put forward a kinetic scheme that was fitted to their voltage-clamp patch-clamp recordings of freshly isolated Purkinje cells. The kinetic model proposed here has one open state and three inactivated states, two states related to fast inactivation (IF1, IF2) and one state related to a slower process (IS). Notably IS and IF are not linked directly in the kinetic scheme.

      In my humble opinion, the proposed kinetic model fails to explain important experimental aspects and falls short to be related to the molecular machinery of sodium channels as outlined below. Still it is due time to advance the concepts of INaR. The new experimental findings of the authors are important in this respect and some ideas of the new model might be integrated in future kinetics schemes. In addition, the framework of INaR is not easy to get hold on with lots of experimental findings in the literature. Likely, my review falls also short in some aspects. Discussion is much needed and appreciated.

      INaT & INaR decay The authors stated that decay speed of INaT and INaR is different and hence different mechanisms are involved. However at a given voltage (-45 mV) they have nicely illustrated (Fig. 2D and in the simulation Fig. 3H) that this is not the case. This statement is also not compatible with the used Markov model. That is because (at a given voltage) the decay of both current identities proceed from the same open state. Apparent inactivation time constants might be different, though, due to the transition to the on state.

      We apologize that the language used was confusing. Our suggestion that there is more than one pathway for inactivation (from an open/conducting state) is the observation that the decay of INaT being biexponential at steady-state voltages. In the revised manuscript, we point out (lines 546-549) that, at some voltages, the slower of the two decay time constants (of INaT) is identical to the time constant of INaR decay. We also discuss how this observation was previously (Raman and Bean, 2001) interpreted.

      Accumulation in the IS state after INaT inactivation in IF1 and IF2 has to proceed through closed states. How is this compatible with current NaV models? The authors have addressed this issue in the discussion. The arguments they have brought forward are not convincing for me since toxins and mutations are grossly impairing channel function.

      Thank you for this comment. We would like to point out that, in our Markov model, Nav channels may accumulate in IS through either the closed state or open state. This requires, of course, that Nav channels can recover from inactivation prior to deactivation. While we agree that toxins and mutations can grossly impair channel function, we think these studies remain crucial in revealing the potential gating mechanisms of Nav channel pore-forming subunits, and how these mechanisms may vary across cell types that express different combinations of accessory proteins.

      Fast inactivation - parallel inactivation pathways Related to the comment above the motivation to introduce a second fast-inactivated state IF2 is not clear. Using three states for inactivation would imply three inactivation time constants (O->IF1, IF1->IF2, O->IS) which are indeed partially visible in the simulation (Fig. 3). However, experimental data of INaT inactivation seldom require more than one time constant for fast inactivation. Importantly the authors do not provide data on INaT inactivation of the model in Fig. 3. Fast Inactivation is mapped to the binding of the IFM particle. In this model at slightly negative potential IF1 and IF2 reverse from absorbing states to dissipating states. How is this compatible with the IFM mechanism? Additionally, the statements in the discussion are not helpful, either a second time constants is required for IF (two distinct states, with two time constants) or not.

      We thank this Reviewer for this comment. We tried to developed the model based on previous data on Nav channel inactivation. Indeed, much experimental data exists for the fast inactivation pathway (O -> IF1). As we noted in the discussion, without the inclusion of the IF2 state, we were unable to fully reproduce our experimental data, which led us to add the IF2 state. As with all model development, we balanced the need to faithfully reproduce the experimental data with efforts to limit the complexity of the model structure. In addition, as noted in the Methods section, our routine is an automatic parameter optimization routine that seeks to minimize the error between simulation and experiments. We can never be sure that we have found an absolute minimum, or that the optimization got stuck at a local minimum when simulating without inclusion of IF2. In other words, there may be a parameter set that sufficiently fits the data without inclusion of IF2, but we were unable to find it. As a safeguard against local minima, we used multistarts of the optimization routine with different initial parameter sets. In each case, we were unable to find a sufficiently acceptable parameter set.

      We agree with this Reviewer that at slightly negative potentials (compared to strong depolarizations), channels exit the IF1 state at different rates, although we would point out that channels dissipate from the IF1 state (accumulating into IS1) under both conditions (see Figure 8B-C). This requires the binding and unbinding of the IFM motif to occur with some voltagesensitivity. We believe this to be a possibility in light of evidence that suggests IFM binding (and fast-inactivation) is an allosteric effect (Yan et al., 2017) and evidence showing that mutations in the pore-lining S6 segments can give rise to shifts of the voltage-dependence of fast inactivation without correlated shifts in the voltage-dependence of activation (Cervenka et al., 2018). However, it remains unclear how voltage-sensing in the Nav channel interact with fast- and slow-inactivation processes.

      Due to space constraints in Figure 3, we did not show a plot of INaT voltage dependence. However, below, please find the experimental data (points), and simulated (line) INaT in our model.

      Differential recovery of INaT & INaR Different kinetics for INaR and INaR are a very interesting finding. In my opinion, this data is not compatible with the proposed Markov model (and the authors do not provide data on the simulation). If INaT1 and INaT2 (Fig. 5 A) have the same amplitude the occupancy of the open state must be the same. I think there is no way to proceed differentially to the open state of INaR in subsequent steps unless e.g. slow inactivated states are introduced.

      Thank you for bringing up this important point. The differential recovery of INaT and INaR indicates there are distinct Nav channel populations underlying the Nav currents in Purkinje neurons. We make this point on lines 632-635 of the revised manuscript. Because our Markov model is used to simulate a single channel population, we do not expect the model to reproduce the results shown in Figure 5. We have now added this point to the Discussion section on lines 637-640.

      Kinetic scheme Comparison with the Raman-Bean model is a bit unfair unless the parameters are fitted to the same dataset used in this study. However, the authors have an important point in stating that this model could not reproduce all aspects of INaR. A more detailed discussion (and maybe analysis) of the states required for the models would be ideal including recent literature (e.g., J Physiol. 2020 Jan;598(2):381-40). Could the Raman-Bean model perform better if an additional inactivated state is introduced? Are alternative connections possible in the proposed model? How ambiguous is the model? Is given my statements above a second open state required? Finally, a better link of the introduced states to NaV structure-function relationship would be beneficial.

      These are all excellent points. We absolutely agree; it was/is not our intention to “prove” that the Raman-Bean model does not fit our dataset (as you mention, with proper refinement of the parameters, some of the data may be well fit). In fact, qualitatively we found the Raman-Bean model quite consistent with our dataset (which is an excellent validation of both the model, and our data). It was our intention to show (in Figure 7) that there is good agreement between the Raman-Bean model and our experimental data for steady state inactivation (C), availability (D), and recovery from inactivation (E). While we find the magnitude of the resurgent current (F) to be markedly different than the Raman-Bean data, we now note this to likely be due to the large differences in the extracellular Na+ concentrations used in voltage-clamp experiments (lines 440-444). Our models, however, specifically differ in our parallel fast and slow inactivation pathways (Figure 7H). As seen in the Raman-Bean model, in response to a prolonged depolarizing holding potential, there is negligible inactivation, as the OB state remains absorbent until the channel is repolarized. This is primarily because the channel must transit through the Open state on repolarization. We find distinctly different behavior in our data. As seen in the experimental data shown in 7H, despite a prolonged depolarization, Nav channels begin to inactivate and accumulate in the slow inactivated state without prerequisite channel opening. This behavior is impossible to fit in the Raman-Bean model, given the topological constraint of the model requiring a single pathway through the open state from the OB state.

      To that point, it is also unlikely that the addition of inactivated states to the Raman-Bean model would help fit this new dataset. Indeed, the Raman-Bean model contains 7 inactivated states. If there were a connection between OB ->I6, it is possible that direct inactivation (bypassing the O state) may help. Again, however, it is not our intention to discredit the Raman-Bean model, nor is it our intention to improve the Raman-Bean model. With new datasets, a fresh look at model topology was undertaken, which is how we developed our proposed model.

      This Reviewer astutely points out a known limitation of Markov (state-chain) modeling; it is impossible to tell uniqueness, or ambiguity of the model (both with parameters as well as model topology). Following the results of Menon et al. 2009 (PNAS vol. 106 / #39 / 16829 – 16834), in which they used a state mutating genetic algorithm to vary topologies of a Markov model, our group (Mangold et al. 2021, PLoS Comp Bio) recently published an algorithm to distinctly enumerate all possible model structures using rooted graph theory (e.g. all possible combinations of models, rooted around a single open state). What we found (which is not entirely surprising) is that there are many model structures and parameter sets that adequately fit certain datasets (e.g., cardiac Nav channels).

      Therefore, the goal is never to find the model (indeed we don’t propose that we have done so), but rather to find a model with acceptable fits to the data and then use that model to hypothesize why that model structure works, as well as to hypothesize higher dimensional dynamics. We make these points in the revised manuscript (lines 591-597).

      We did not specifically explore the impact of a second open state in our modeling and simulation studies, but we would certainly agree that a model with a second open state may recapitulate the dataset.

    1. Author Response

      Reviewer #3: (Public Review):

      In this ms Li et al. examine the molecular interaction of Rabphilin 3A with the SNARE complex protein SNAP25 and its potential impact in SNARE complex assembly and dense core vesicle fusion.

      Overall the literature of rabphilin as a major rab3/27effector on synaptic function has been quite enigmatic. After its cloning and initial biochemical analysis, rather little new has been found about rabphilin, in particular since loss of function analysis has shown rather little synaptic phenotypes (Schluter 1999, Deak 2006), arguing against that rabphilin plays a crucial role in synaptic function.

      While the interaction of rabphilin to SNAP25 via its bottom part of the C2 domain has been already described biochemically and structurally in the Deak et al. 2006, and others, the authors make significant efforts to further map the interactions between SNAP25 and rabphilin and indeed identified additional binding motifs in the first 10 amino acids of SNAP25 that appear critical for the rabphilin interaction.

      Using KD-rescue experiments for SNAP25, in TIRF based imaging analysis of labeled dense core vesicles showed that the N-terminus of SN25 is absolutely essential for SV membrane proximity and release. Similar, somewhat weaker phenotypes were observed when binding deficient rabphilin mutants were overexpressed in PC12 cells coexpressing WT rabphilin. The loss of function phenotypes in the SN25 and rabphilin interaction mutants made the authors to claim that rabphilin-SN25 interactions are critical for docking and exocytosis. The role of these interaction sites were subsequently tested in SNARE assembly assays, which were largely supportive of rabphilin accelerating SNARE assembly in a SN25 -terminal dependent way.

      Regarding the impact of this work, the transition of synaptic vesicles to form fusion competent trans-SNARE complex is very critical in our understanding of regulated vesicle exocytosis, and the authors put forward an attractive model forward in which rabphilin aids in catalyzing the SNARE complex assembly by controlling SNAP25 a-helicalicity of the SNARE motif. This would provide here a similar regulatory mechanism as put forward for the other two SNARE proteins via their interactions with Munc18 and intersection, respectively.

      We thank the reviewer #3 for the summary of the paper and for the praise of our work. The point-to-point replies are as follow:

      While discovery of the novel interaction site of rabphilin with the N-Terminus of SNAP25 is interesting, I have issues with the functional experiments. The key reliance of the paper is whether it provides convincing data on the functional role of the interactions, given the history of loss of function phenotypes for Rabphilin. First, the authors use PC12 cells and dense core vesicle docking and fusion assays. Primary neurons, where rabphilin function has been tested before, has unfortunately not been utilized, reducing the impact of docking and fusion phenotype.

      We have discussed these questions as mentioned in our response to Essential Revisions 3 and added this corresponding passage to the Discussion section (pp.18-19, lines 407-427).

      In particular the loss of function phenotype in figure 3 of the n-terminally deleted SNAP25 in docking and fusion is profound, and at a similar level than the complete loss of the SNARE protein itself. This is of concern as this is in stark contrast to the phenotype of rabphilin loss in mammalian neurons where the phenotype of SNAP25 loss is very severe while rabphilin loss has almost no effect on secretion. This would argue that the N-terminal of SNAPP25 has other critical functions besides interacting with rabphilin. In addition, it could argue that the n-Terminal SNAP25 deletion mutant may be made in the cell (as indicated from the western blot) but may not be properly trafficked to the site of release

      To test whether the N-peptide deletion mutant of SN25 can properly target to the plasma membrane, we overexpressed the SN25 FL or SN25 (11–206) with C-terminal EGFP-tag in PC12 cells and monitored the localization of SN25 FL-EGFP and SN25 (11–206)-EGFP near the plasma membrane by TIRF microscopy. We observed that the average fluorescence intensity of SN25 (11–206)-EGFP showed no significant difference with SN25 FL-EGFP as below, suggesting that the N-peptide deletion mutant may not influence the trafficking of SN25 to plasma membrane.

      (A) TIRF imaging assay to monitor the localization of SN25-EGFP near the plasma membrane. Overexpression of SN25 FL-EGFP (left) and SN25 (11–206)-EGFP (right) using pEGFP-N3 vector in PC12 cells. Scale bars, 10 μm. (B) Quantification of the average fluorescence intensity of SN25-EGFP near the plasma membrane in (A). Data are presented as mean ± SEM (n ≥ 10 cells in each). Statistical significance and P values were determined by Student’s t-test. ns, not significant.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors present a PyTorch-based simulator for prosthetic vision. The model takes in the anatomical location of a visual cortical prostheses as well as a series of electrical stimuli to be applied to each electrode, and outputs the resulting phosphenes. To demonstrate the usefulness of the simulator, the paper reproduces psychometric curves from the literature and uses the simulator in the loop to learn optimized stimuli.

      One of the major strengths of the paper is its modeling work - the authors make good use of existing knowledge about retinotopic maps and psychometric curves that describe phosphene appearance in response to single-electrode stimulation. Using PyTorch as a backbone is another strength, as it allows for GPU integration and seamless integration with common deep learning models. This work is likely to be impactful for the field of sight restoration.

      1) However, one of the major weaknesses of the paper is its model validation - while some results seem to be presented for data the model was fit on (as opposed to held-out test data), other results lack quantitative metrics and a comparison to a baseline ("null hypothesis") model. On the one hand, it appears that the data presented in Figs. 3-5 was used to fit some of the open parameters of the model, as mentioned in Subsection G of the Methods. Hence it is misleading to present these as model "predictions", which are typically presented for held-out test data to demonstrate a model's ability to generalize. Instead, this is more of a descriptive model than a predictive one, and its ability to generalize to new patients remains yet to be demonstrated.

      We agree that the original presentation of the model fits might give rise to unwanted confusion. In the revision, we have adapted the fit of the thresholding mechanism to include a 3-fold cross validation, where part of the data was excluded during the fitting, and used as test sets to calculate the model’s performance. The results of the cross- validation are now presented in panel D of Figure 3. The fitting of the brightness and temporal dynamics parameters using cross-validation was not feasible due to the limited amount of quantitative data describing temporal dynamics and phosphene size and brightness for intracortical electrodes. To avoid confusion, we have adapted the corresponding text and figure captions to specify that we are using a fit as description of the data.

      We note that the goal of the simulator is not to provide a single set of parameters that describes precise phosphene perception for all patients but that it could also be used to capture variability among patients. Indeed, the model can be tailored to new patients based on a small data set. Figure 3-figure supplement 1 exemplifies how our simulator can be tailored to several data sets collected from patients with surface electrodes. Future clinical experiments might be used to verify how well the simulator can be tailored to the data of other patients.

      Specifically, we have made the following changes to the manuscript:

      • Caption Figure 2: the fitted peak brightness levels reproduced by our model

      • Caption Figure 3: The model's probability of phosphene perception is visualized as a function of charge per phase

      • Caption Figure 3: Predicted probabilities in panel (d) are the results of a 3-fold cross- validation on held-out test data.

      • Line 250: we included biologically inspired methods to model the perceptual effects of different stimulation parameters

      • Line 271: Each frame, the simulator maps electrical stimulation parameters (stimulation current, pulse width and frequency) to an estimated phosphene perception

      • Lines 335-336: such that 95% of the Gaussian falls within the fitted phosphene size.

      • Line 469-470: Figure 4 displays the simulator's fit on the temporal dynamics found in a previous published study by Schmidt et al. (1996).

      • Lines 922-925: Notably, the trade-off between model complexity and accurate psychophysical fits or predictions is a recurrent theme in the validation of the components implemented in our simulator.

      2) On the other hand, the results presented in Fig. 8 as part of the end-to-end learning process are not accompanied by any sorts of quantitative metrics or comparison to a baseline model.

      We now realize that the presentation of the end-to-end results might have given the impression that we present novel image processing strategies. However, the development of a novel image processing strategy is outside the scope of the study. Instead, The study aims to provide an improved simulation which can be used for more realistic assessment of different stimulation protocols. The simulator needs to fit experimental data, and it should run fast (so it can be used in behavioral experiments). Importantly, as demonstrated in our end-to-end experiments, the model can be used in differentiable programming pipelines (so it can be used in computational optimization experiments), which is a valuable contribution in itself because it lends itself to many machine learning approaches which can improve the realism of the simulation.

      We have rephrased our study aims in the discussion to improve clarity.

      • Lines 275-279: In the sections below, we discuss the different components of the simulator model, followed by a description of some showcase experiments that assess the ability to fit recent clinical data and the practical usability of our simulator in simulation experiments

      • Lines 810-814: Computational optimization approaches can also aid in the development of safe stimulation protocols, because they allow a faster exploration of the large parameter space and enable task-driven optimization of image processing strategies (Granley et al., 2022; Fauvel et al., 2022; White et al., 2019; Küçükoglü et al. 2022; de Ruyter van Steveninck et al., 2022; Ghaffari et al., 2021).

      • Lines 814-819: Ultimately, the development of task-relevant scene-processing algorithms will likely benefit both from computational optimization experiments as well as exploratory SPV studies with human observers. With the presented simulator we aim to contribute a flexible toolkit for such experiments.

      • Lines 842-853: Eventually, the functional quality of the artificial vision will not only depend on the correspondence between the visual environment and the phosphene encoding, but also on the implant recipient's ability to extract that information into a usable percept. The functional quality of end-to-end generated phosphene encodings in daily life tasks will need to be evaluated in future experiments. Regardless of the implementation, it will always be important to include human observers (both sighted experimental subjects and actual prosthetic implant users in the optimization cycle to ensure subjective interpretability for the end user (Fauvel et al., 2022; Beyeler & Sanchez-Garcia, 2022).

      3) The results seem to assume that all phosphenes are small Gaussian blobs, and that these phosphenes combine linearly when multiple electrodes are stimulated. Both assumptions are frequently challenged by the field. For all these reasons, it is challenging to assess the potential and practical utility of this approach as well as get a sense of its limitations.

      The reviewer raises a valid point and a similar point was raised by a different reviewer (our response is duplicated). As pointed out in the discussion, many aspects about multi- electrode phosphene perception are still unclear. On the one hand, the literature is in agreement that there is some degree of predictability: some papers explicitly state that phosphenes produced by multiple patterns are generally additive (Dobelle & Mladejovsky, 1974), that the locations are predictable (Bosking et al., 2018) and that multi-electrode stimulation can be used to generate complex, interpretable patterns of phosphenes (Chen et al., 2020, Fernandez et al., 2021). On the other hand, however, in some cases, the stimulation of multiple electrodes is reported to lead to brighter phosphenes (Fernandez et al., 2021), fused or displaced phosphenes (Schmidt et al., 1996, Bak et al., 1990) or unpredicted phosphene patterns (Fernández et al., 2021). It is likely that the probability of these interference patterns decreases when the distance between the stimulated electrodes increases. An empirical finding is that the critical distance for intracortical stimulation is approximately 1 mm (Ghose & Maunsell, 2012).

      We note that our simulator is not restricted to the simulation of linearly combined Gaussian blobs. Some irregularities, such as elongated phosphene shapes were already supported in the previous version of our software. Furthermore, we added a supplementary figure that displays a possible approach to simulate some of the more complex electrode interactions that are reported in the literature, with only minor adaptations to the code. Our study thereby aims to present a flexible simulation toolkit that can be adapted to the needs of the user.

      Adjustments:

      • Added Figure 1-figure supplement 3 on irregular phosphene percepts.

      • Lines 957-970: Furthermore, in contrast to the assumptions of our model, interactions between simultaneous stimulation of multiple electrodes can have an effect on the phosphene size and sometimes lead to unexpected percepts (Fernandez et al., 2021, Dobelle & Mladejovsky 1974, Bak et al., 1990). Although our software supports basic exploratory experimentation of non-linear interactions (see Figure 1-figure supplement 3), by default, our simulator assumes independence between electrodes. Multi- phosphene percepts are modeled using linear summation of the independent percepts. These assumptions seem to hold for intracortical electrodes separated by more than 1 mm (Ghose & Maunsell, 2012), but may underestimate the complexities observed when electrodes are nearer. Further clinical and theoretical modeling work could help to improve our understanding of these non-linear dynamics.

      4) Another weakness of the paper is the term "biologically plausible", which appears throughout the manuscript but is not clearly defined. In its current form, it is not clear what makes this simulator "biologically plausible" - it certainly contains a retinotopic map and is fit on psychophysical data, but it does not seem to contain any other "biological" detail.

      We thank the reviewer for the remark. We improved our description of what makes the simulator “biologically plausible” in the introduction (line 78): ‘‘Biological plausibility, in our work's context, points to the simulation's ability to capture essential biological features of the visual system in a manner consistent with empirical findings: our simulator integrates quantitative findings and models from the literature on cortical stimulation in V1 [...]”. In addition, we mention in the discussion (lines 611 - 621): “The aim of this study is to present a biologically plausible phosphene simulator, which takes realistic ranges of stimulation parameters, and generates a phenomenologically accurate representation of phosphene vision using differentiable functions. In order to achieve this, we have modeled and incorporated an extensive body of work regarding the psychophysics of phosphene perception. From the results presented in section H, we observe that our simulator is able to produce phosphene percepts that match the descriptions of phosphene vision that were gathered in basic and clinical visual neuroprosthetics studies over the past decades.”

      5) In fact, for the most part the paper seems to ignore the fact that implanting a prosthesis in one cerebral hemisphere will produce phosphenes that are restricted to one half of the visual field. Yet Figures 6 and 8 present phosphenes that seemingly appear in both hemifields. I do not find this very "biologically plausible".

      We agree with the reviewer that contemporary experiments with implantable electrodes usually test electrodes in a single hemisphere. However, future clinically useful approaches should use bilaterally implanted electrode arrays. Our simulator can either present phosphene locations in either one or both hemifields.

      We have made the following textual changes:

      • Fig. 1 caption: Example renderings after initializing the simulator with four 10 × 10 electrode arrays (indicated with roman numerals) placed in the right hemisphere (electrode spacing: 4 mm, in correspondence with the commonly used 'Utah array' (Maynard et al., 1997)).

      • Line 518-525: The simulator is initialized with 1000 possible phosphenes in both hemifields, covering a field of view of 16 degrees of visual angle. Note that the simulated electrode density and placement differs from current prototype implants and the simulation can be considered to be an ambitious scenario from a surgical point of view, given the folding of the visual cortex and the part of the retinotopic map in V1 that is buried in the calcarine sulcus. Line 546-547: with the same phosphene coverage as the previously described experiment

      Reviewer #2 (Public Review):

      Van der Grinten and De Ruyter van Steveninck et al. present a design for simulating cortical- visual-prosthesis phosphenes that emphasizes features important for optimizing the use of such prostheses. The characteristics of simulated individual phosphenes were shown to agree well with data published from the use of cortical visual prostheses in humans. By ensuring that functions used to generate the simulations were differentiable, the authors permitted and demonstrated integration of the simulations into deep-learning algorithms. In concept, such algorithms could thereby identify parameters for translating images or videos into stimulation sequences that would be most effective for artificial vision. There are, however, limitations to the simulation that will limit its applicability to current prostheses.

      The verification of how phosphenes are simulated for individual electrodes is very compelling. Visual-prosthesis simulations often do ignore the physiologic foundation underlying the generation of phosphenes. The authors' simulation takes into account how stimulation parameters contribute to phosphene appearance and show how that relationship can fit data from actual implanted volunteers. This provides an excellent foundation for determining optimal stimulation parameters with reasonable confidence in how parameter selections will affect individual-electrode phosphenes.

      We thank the reviewer for these supportive comments.

      Issues with the applicability and reliability of the simulation are detailed below:

      1) The utility of this simulation design, as described, unfortunately breaks down beyond the scope of individual electrodes. To model the simultaneous activation of multiple electrodes, the authors' design linearly adds individual-electrode phosphenes together. This produces relatively clean collections of dots that one could think of as pixels in a crude digital display. Modeling phosphenes in such a way assumes that each electrode and the network it activates operate independently of other electrodes and their neuronal targets. Unfortunately, as the authors acknowledge and as noted in the studies they used to fit and verify individual-electrode phosphene characteristics, simultaneous stimulation of multiple electrodes often obscures features of individual-electrode phosphenes and can produce unexpected phosphene patterns. This simulation does not reflect these nonlinearities in how electrode activations combine. Nonlinearities in electrode combinations can be as subtle the phosphenes becoming brighter while still remaining distinct, or as problematic as generating only a single small phosphene that is indistinguishable from the activation of a subset of the electrodes activated, or that of a single electrode.

      If a visual prosthesis happens to generate some phosphenes that can be elicited independently, a simulator of this type could perhaps be used by processing stimulation from independent groups of electrodes and adding their phosphenes together in the visual field.

      The reviewer raises a valid point and a similar point was raised by a different reviewer (our response is duplicated). As pointed out in the discussion, many aspects about multi- electrode phosphene perception are still unclear. On the one hand, the literature is in agreement that there is some degree of predictability: some papers explicitly state that phosphenes produced by multiple patterns are generally additive (Dobelle & Mladejovsky, 1974), that the locations are predictable (Bosking et al., 2018) and that multi-electrode stimulation can be used to generate complex, interpretable patterns of phosphenes (Chen et al., 2020, Fernandez et al., 2021). On the other hand, however, in some cases, the stimulation of multiple electrodes is reported to lead to brighter phosphenes (Fernandez et al., 2021), fused or displaced phosphenes (Schmidt et al., 1996, Bak et al., 1990) or unpredicted phosphene patterns (Fernández et al., 2021). It is likely that the probability of these interference patterns decreases when the distance between the stimulated electrodes increases. An empirical finding is that the critical distance for intracortical stimulation is approximately 1 mm (Ghose & Maunsell, 2012).

      We note that our simulator is not restricted to the simulation of linearly combined Gaussian blobs. Some irregularities, such as elongated phosphene shapes were already supported in the previous version of our software. Furthermore, we added a supplementary figure that displays a possible approach to simulate some of the more complex electrode interactions that are reported in the literature, with only minor adaptations to the code. Our study thereby aims to present a flexible simulation toolkit that can be adapted to the needs of the user.

      Adjustments:

      • Lines 957-970: Furthermore, in contrast to the assumptions of our model, interactions between simultaneous stimulation of multiple electrodes can have an effect on the phosphene size and sometimes lead to unexpected percepts (Fernandez et al., 2021, Dobelle & Mladejovsky 1974, Bak et al., 1990). Although our software supports basic exploratory experimentation of non-linear interactions (see Figure 1-figure supplement 3), by default, our simulator assumes independence between electrodes. Multi- phosphene percepts are modeled using linear summation of the independent percepts. These assumptions seem to hold for intracortical electrodes separated by more than 1 mm (Ghose & Maunsell, 2012), but may underestimate the complexities observed when electrodes are nearer. Further clinical and theoretical modeling work could help to improve our understanding of these non-linear dynamics.

      • Added Figure 1-figure supplement 3 on irregular phosphene percepts.

      2) Verification of how the simulation renders individual phosphenes based on stimulation parameters is an important step in confirming agreement between the simulation and the function of implanted devices. That verification was well demonstrated. The end use a visual-prosthesis simulation, however, would likely not be optimizing just the appearance of phosphenes, but predicting and optimizing functional performance in visual tasks. Investigating whether this simulator can suggest visual-task performance, either with sighted volunteers or a decoder model, that is similar to published task performance from visual-prosthesis implantees would be a necessary step for true validation.

      We agree with the reviewer that it will be vital to investigate the utility of the simulator in tasks. However, the literature on the performance of users of a cortical prosthesis in visually-guided tasks is scarce, making it difficult to compare task performance between simulated versus real prosthetic vision.

      Secondly, the main objective of the current study is to propose a simulator that emulates the sensory / perceptual experience, i.e. the low-level perceptual correspondence. Once more behavioral data from prosthetic users become available, studies can use the simulator to make these comparisons.

      Regarding the comparison to simulated prosthetic vision in sighted volunteers, there are some fundamental limitations. For instance, sighted subjects are exposed for a shorter duration to the (simulated) artificial percept and lack the experience and training that prosthesis users get. Furthermore, sighted subjects may be unfamiliar with compensation strategies that blind individuals have developed. It will therefore be important to conduct clinical experiments.

      To convey more clearly that our experiments are performed to verify the practical usability in future behavioral experiments, we have incorporated the following textual adjustments:

      • Lines 275-279: In the sections below, we discuss the different components of the simulator model, followed by a description of some showcase experiments that assess the ability to fit recent clinical data and the practical usability of our simulator in simulation experiments.

      • Lines 842-853: Eventually, the functional quality of the artificial vision will not only depend on the correspondence between the visual environment and the phosphene encoding, but also on the implant recipient's ability to extract that information into a usable percept. The functional quality of end-to-end generated phosphene encodings in daily life tasks will need to be evaluated in future experiments. Regardless of the implementation, it will always be important to include human observers (both sighted experimental subjects and actual prosthetic implant users in the optimization cycle to ensure subjective interpretability for the end (Fauvel et al., 2022; Beyeler & Sanchez- Garcia, 2022).

      3) A feature of this simulation is being able to convert stimulation of V1 to phosphenes in the visual field. If used, this feature would likely only be able to simulate a subset of phosphenes generated by a prosthesis. Much of V1 is buried within the calcarine sulcus, and electrode placement within the calcarine sulcus is not currently feasible. As a result, stimulation of visual cortex typically involves combinations of the limited portions of V1 that lie outside the sulcus and higher visual areas, such as V2.

      We agree that some areas (most notably the calcarine sulcus) are difficult to access in a surgical implantation procedure. A realistic simulation of state-of-the-art cortical stimulation should only partially cover the visual field with phosphenes. However, it may be predicted that some of these challenges will be addressed by new technologies. We chose to make the simulator as generally applicable as possible and users of the simulator can decide which phosphene locations are simulated. To demonstrate that our simulator can be flexibly initialized to simulate specific implantation locations using third- party software, we have now added a supplementary figure (Figure 1-figure supplement 1) that displays a demonstration of an electrode grid placement on a 3D brain model, generating the phosphene locations from receptive field maps. However, the simulator is general and can also be used to guide future strategies that aim to e.g. cover the entire field with electrodes, compare performance between upper and lower hemifields etc.

      Reviewer #3 (Public Review):

      The authors are presenting a new simulation for artificial vision that incorporates many recent advances in our understanding of the neural response to electrical stimulation, specifically within the field of visual prosthetics. The authors succeed in integrating multiple results from other researchers on aspects of V1 response to electrical stimulation to create a system that more accurately models V1 activation in a visual prosthesis than other simulators. The authors then attempt to demonstrate the value of such a system by adding a decoding stage and using machine-learning techniques to optimize the system to various configurations.

      1) While there is merit to being able to apply various constraints (such as maximum current levels) and have the system attempt to find a solution that maximizes recoverable information, the interpretability of such encodings to a hypothetical recipient of such a system is not addressed. The authors demonstrate that they are able to recapitulate various standard encodings through this automated mechanism, but the advantages to using it as opposed to mechanisms that directly detect and encode, e.g., edges, are insufficiently justified.

      We thank the reviewer for this constructive remark. Our simulator is designed for more realistic assessment of different stimulation protocols in behavioral experiments or in computational optimization experiments. The presented end-to-end experiments are a demonstration of the practical usability of our simulator in computational experiments, building on a previously existing line of research. In fact, our simulator is compatible with any arbitrary encoding strategy.

      As our paper is focused on the development of a novel tool for this existing line of research, we do not aim to make claims about the functional quality of end-to-end encoders compared to alternative encoding methods (such as edge detection). That said, we agree with the reviewer that it is useful to discuss the benefits of end-to-end optimization compared to e.g. edge detection will be useful.

      We have incorporated several textual changes to give a more nuanced overview and to acknowledge that many benefits remain to be tested. Furthermore, we have restated our study aims more clearly in the discussion to clarify the distinction between the goals of the current paper and the various encoding strategies that remain to be tested.

      • Lines 275-279: In the sections below, we discuss the different components of the simulator model, followed by a description of some showcase experiments that assess the ability to fit recent clinical data and the practical usability of our simulator in simulation experiments

      • Lines 810-814: Computational optimization approaches can also aid in the development of safe stimulation protocols, because they allow a faster exploration of the large parameter space and enable task-driven optimization of image processing strategies (Granley et al., 2022; Fauvel et al., 2022; White et al., 2019; Küçükoglü et al. 2022; de Ruyter van Steveninck, Güçlü et al., 2022; Ghaffari et al., 2021).

      • Lines 842-853: Eventually, the functional quality of the artificial vision will not only depend on the correspondence between the visual environment and the phosphene encoding, but also on the implant recipient's ability to extract that information into a usable percept. The functional quality of end-to-end generated phosphene encodings in daily life tasks will need to be evaluated in future experiments. Regardless of the implementation, it will always be important to include human observers (both sighted experimental subjects and actual prosthetic implant users in the optimization cycle to ensure subjective interpretability for the end user (Fauvel et al., 2022; Beyeler & Sanchez-Garcia, 2022).

      2) The authors make a few mistakes in their interpretation of biological mechanisms, and the introduction lacks appropriate depth of review of existing literature, giving the reader the mistaken impression that this is simulator is the only attempt ever made at biologically plausible simulation, rather than merely the most recent refinement that builds on decades of work across the field.

      We thank the reviewer for this insight. We have improved the coverage of the previous literature to give credit where credit is due, and to address the long history of simulated phosphene vision.

      Textual changes:

      • Lines 64-70: Although the aforementioned SPV literature has provided us with major fundamental insights, the perceptual realism of electrically generated phosphenes and some aspects of the biological plausibility of the simulations can be further improved and by integrating existing knowledge of phosphene vision and its underlying physiology.

      • Lines 164-190: The aforementioned studies used varying degrees of simplification of phosphene vision in their simulations. For instance, many included equally-sized phosphenes that were uniformly distributed over the visual field (informally referred to as the ‘scoreboard model’). Furthermore, most studies assumed either full control over phosphene brightness or used binary levels of brightness (e.g. 'on' / 'off'), but did not provide a description of the associated electrical stimulation parameters. Several studies have explicitly made steps towards more realistic phosphene simulations, by taking into account cortical magnification or using visuotopic maps (Fehervari et al., 2010;, Li et al., 2013; Srivastava et al., 2009; Paraskevoudi et al., 2021), simulating noise and electrode dropout (Dagnelie et al., 2007), or using varying levels of brightness (Vergnieux et al., 2017; Sanchez-Garcia et al., 2022; Parikh et al., 2013). However, no phosphene simulations have modeled temporal dynamics or provided a description of the parameters used for electrical stimulation. Some recent studies developed descriptive models of the phosphene size or brightness as a function of the stimulation parameters (Winawer et al., 2016; Bosking et al., 2017). Another very recent study has developed a deep-learning based model for predicting a realistic phosphene percept for single stimulating electrodes (Granley et al., 2022). These studies have made important contributions to improve our understanding of the effects of different stimulation parameters. The present work builds on these previous insights to provide a full simulation model that can be used for the functional evaluation of cortical visual prosthetic systems.

      • Lines 137-140: Due to the cortical magnification (the foveal information is represented by a relatively large surface area in the visual cortex as a result of variation of retinal RF size) the size of the phosphene increases with its eccentricity (Winawer & Parvizi, 2016, Bosking et al., 2017).

      • Lines 883-893: Even after loss of vision, the brain integrates eye movements for the localization of visual stimuli (Reuschel et al., 2012), and in cortical prostheses the position of the artificially induced percept will shift along with eye movements (Brindley & Lewin, 1968, Schmidt et al., 1996). Therefore, in prostheses with a head-mounted camera, misalignment between the camera orientation and the pupillary axes can induce localization problems (Caspi et al., 2018; Paraskevoudi & Pezaris, 2019; Sabbah et al., 2014; Schmidt et al., 1996). Previous SPV studies have demonstrated that eye-tracking can be implemented to simulate the gaze-coupled perception of phosphenes (Cha et al., 1992; Sommerhalder et al., 2004; Dagnelie et al., 2006; McIntosh et al., 2013, Paraskevoudi & Pezaris, 2021; Rassia & Pezaris 2018, Titchener et al., 2018, Srivastava et al., 2009)

      3) The authors have importantly not included gaze position compensation which adds more complexity than the authors suggest it would, and also means the simulator lacks a basic, fundamental feature that strongly limits its utility.

      We agree with the reviewer that the inclusion of gaze position to simulate gaze-centered phosphene locations is an important requirement for a realistic simulation. We have made several textual adjustments to section M1 to improve the clarity of the explanation and we have added several references to address the simulation literature that took eye movements into account.

      In addition, we included a link to some demonstration videos in which we illustrate that the simulator can be used for gaze-centered phosphene simulation. The simulation models the phosphene locations based on the gaze direction, and updates the input with changes in the gaze direction. The stimulation pattern is chosen to encode the visual environment at the location where the gaze is directed. Gaze contingent processing has been implemented in prior simulation studies (for instance: Paraskevoudi et al., 2021; Rassia et al., 2018; Titchener et al., 2018) and even in the clinical setting with users of the Argus II implant (Caspi et al., 2018). From a modeling perspective, it is relatively straightforward to simulate gaze-centered phosphene locations and gaze contingent image processing (our code will be made publicly available). At the same time, however, seen from a clinical and hardware engineering perspective, the implementation of eye-tracking in a prosthetic system for blind individuals might come with additional complexities. This is now acknowledged explicitly in the manuscript.

      Textual adjustment:

      Lines 883-910: Even after loss of vision, the brain integrates eye movements for the localization of visual stimuli (Reuschel et al., 2012), and in cortical prostheses the position of the artificially induced percept will shift along with eye movements (Brindley & Lewin, 1968, Schmidt et al., 1996). Therefore, in prostheses with a head-mounted camera, misalignment between the camera orientation and the pupillary axes can induce localization problems (Caspi et al., 2018; Paraskevoudi & Pezaris, 2019; Sabbah et al., 2014; Schmidt et al., 1996). Previous SPV studies have demonstrated that eye-tracking can be implemented to simulate the gaze-coupled perception of phosphenes (Cha et al., 1992; Sommerhalder et al., 2004; Dagnelie et al., 2006, McIntosh et al., 2013; Paraskevoudi et al., 2021; Rassia et al., 2018; Titchener et al., 2018; Srivastava et al., 2009). Note that some of the cited studies implemented a simulation condition where not only the simulated phosphene locations, but also the stimulation protocol depended on the gaze direction. More specifically, instead of representing the head-centered camera input, the stimulation pattern was chosen to encode the external environment at the location where the gaze was directed. While further research is required, there is some preliminary evidence that such a gaze-contingent image processing can improve the functional and subjective quality of prosthetic vision (Caspi et al., 2018; Paraskevoudi et al., 2021; Rassia et al., 2018; Titchener et al., 2018). Some example videos of gaze-contingent simulated prosthetic vision can be retrieved from our repository (https://github.com/neuralcodinglab/dynaphos/blob/main/examples/). Note that an eye-tracker will be required to produce gaze-contingent image processing in visual prostheses and there might be unforeseen complexities in the clinical implementation thereof. The study of oculomotor behavior in blind individuals (with or without a visual prosthesis) is still an ongoing line of research (Caspi et al.,2018; Kwon et al., 2013; Sabbah et al., 2014; Hafed et al., 2016).

      4) Finally, the computational capacity required to run the described system is substantial and is not one that would plausibly be used as part of an actual device, suggesting that there may be difficulties with converting results from this simulator to an implantable system.

      The software runs in real time with affordable, consumer-grade hardware. In Author response image 1 we present the results of performance testing with a 2016 model MSI GeForce GTX 1080 (priced around €600).

      Author response image 1.

      Note that the GPU is used only for the computation and rendering of the phosphene representations from given electrode stimulation patterns, which will never be part of any prosthetic device. The choice of encoder to generate the stimulation patterns will determine the required processing capacity that needs to be included in the prosthetic system, which is unrelated to the simulator’s requirements.

      The following addition was made to the text:

      • Lines 488-492: Notably, even on a consumer-grade GPU (e.g. a 2016 model GeForce GTX 1080) the simulator still reaches real-time processing speeds (>100 fps) for simulations with 1000 phosphenes at 256x256 resolution.

      5) With all of that said, the results do represent an advance, and one that could have wider impact if the authors were to reduce the computational requirements, and add gaze correction.

      We appreciate the kind compliment from the reviewer and sincerely hope that our revised manuscript meets their expectations. Their feedback has been critical to reshape and improve this work.

    1. Author Response

      Reviewer #3 (Public Review):

      In this manuscript, the authors studied the erythropoiesis and hematopoietic stem/progenitor cell (HSPC) phenotypes in a ribosome gene Rps12 mutant mouse model. They found that RpS12 is required for both steady and stress hematopoiesis. Mechanistically, RpS12+/- HSCs/MPPs exhibited increased cycling, loss of quiescence, protein translation rate, and apoptosis rates, which may be attributed to ERK and Akt/mTOR hyperactivation. Overall, this is a new mouse model that sheds light into our understanding of Rps gene function in murine hematopoiesis. The phenotypic and functional analysis of the mice are largely properly controlled, robust, and analyzed.

      A major weakness of this work is its descriptive nature, without a clear mechanism that explains the phenotypes observed in RpS12+/- mice. It is possible that the counterintuitive activation of ERK/mTOR pathway and increased protein synthesis rate is a compensatory negative feedback. Direct mechanism of Rps12 loss could be studied by ths acute loss of Rps12, which is doable using their floxed mice. At the minimum, this can be done in mammalian hematopoietic cell lines.

      We thank the reviewer for pointing this out. We have addressed this question by developing a new inducible conditional knockout Rps12 mouse model (see response below to major point 1).

      Below are some specific concerns need to be addressed.

      1) Line 226. The authors conclude that "Together, these results suggest that RpS12 plays an essential role in HSC function, including self-renewal and differentiation." The reviewer has three concerns regarding this conclusion and corresponding Figure3. 1) The data shows that RpS12+/- mice have decreased number of both total BM cells and multiple subpopulations of HSPCs. The frequency of HSPC subpopulations should also be shown to clarify if the decreased HSPC numbers arises from decreased total BM cellularity or proportionally decrease in frequency. 2) This figure characterizes phenotypic HSPC in BM by flow and lineage cells in PB by CBC. HSC function and differentiation are not really examined in this figure, except for the colony assay in Figure 3K. BMT data in Figure4 is actually for HSC function and differentiation. So the conclusion here should be rephrased. 3) Since all LT-, ST-HSCs, as well as all MPPs are decreased in number, how can the authors conclude that Rps12 is important for HSC differentiation? No experiments presented here were specifically designed to address HSC differentiation.

      We thank the reviewer for this excellent point. We think that the main defect is in HSC and progenitor maintenance, rather than in HSC differentiation. This is consistent with the decrease in multiple HSC and progenitor populations, as observed both by calculating absolute numbers and by frequency of the parent population (see new Supplementary Figures S2C-S2C). We have removed any references to altered differentiation from the text.

      We added data on the population frequency in the Supplementary Figure 2. And in the corresponding text. See lines 221-235.

      2) Figure 3A and 5E. The flow cytometry gating of HSC/MPP is not well performed or presented, especially HSC plot. Populations are not well separated by phenotypic markers. This concerns the validity of the quantification data.

      We chose a better representative HSC plot and included it in the Figure 3A

      3) It is very difficult to read bone marrow cytospin images in Fig 6F without annotation of cell types shown in the figure. It appears that WT and +/- looked remarkably different in terms of cell size and cell types. This mouse may have other profound phenotypes that need detailed examination, such as lineage cells in the BM and spleen, and colony assays for different types of progenitors, etc.

      The purpose of the bone marrow cytospin images in Figure 6F was to show the high number of apoptotic cells in the bone marrow of Rps12 KO/+ mice compared with controls. The differences in apoptosis in the LSK and myeloid progenitor populations are quantified in the flow cytometry data shown in Figure 6G-H. A detailed quantitative analysis of different bone marrow cell populations and their relative frequencies is also shown in Figures 2 and 3. In Rps12 KO/+ bone marrow, we observed a significant decrease in multiple stem cell and progenitor populations.

      4) For all the intracellular phospho-flow shown in Fig7, both a negative control of a fluorescent 2nd antibody only and a positive stimulus should be included. It is very concerning that no significant changes of pAKT and pERK signaling (MFI) after SCF stimulation from the histogram in WT LSKs. There are no distinct peaks that indicate non-phospho-proteins and phosphoproteins. This casts doubt on the validity of results. It is possible though that Rsp12+/- have very high basal level of activation of pAKT/mTOR and pERK pathway. This again may point to a negative feedback mechanism of Rps12 haploinsufficiency.

      It is true that we did not observe an increase in pAKT, p4EBP1, or pERK in control cells in every case. This is often an issue with these specific phospho-flow cytometry antibodies, as they are not very sensitive, and the response to SCF is very time-dependent. We did observe an increase in pS6 with SCF in both LSK cells and progenitors (Figure 7B, E). However, the main point of this experiment was to assess the basal level of signaling in Rps12 KO/+ vs control cells. We did not observe hypersensitivity of RpS12 cells to SCF, but we did observe significant increases in pAKT, pS6, p4EBP1, and pERK in Rsp12 KO/+ LSK cells.

      To address the concern about the validity of staining, please see the requested flow histograms for unstained vs individual Phospho-antibodies (Ab): p4EBP1, pERK, pS6 and pAKT (Figure R1 for reviewers) below. Additionally, since staining with the surface antibodies potentially can change the peak, we are including additional an control of the cell surface antibodies vs full sample with surface antibodies and Phospho-Ab: p4EBP1, pERK, pS6 and pAKT. We can include this figure in the Supplementary Data if requested.

      5) The authors performed in vitro OP-Puro assay to assess the global protein translation in different HSPC subpopulations. 1) Can the authors provide more information about the incubation media, any cytokine or serum included? The incubation media with supplements may boost the overall translation status, although cells from WT and RpS12+/- are cultured side by side. Based on this, in vivo OP-Puro assay should be performed in both genotypes. 2) Polysome profiling assay should be performed in primary HSPCs, or at least in hematopoietic cell lines. It is plausible that RpS12 haploinsufficiency may affect the content of translational polysome fractions.

      We are including these details in the methods section: for in vitro OP-Puro assay (lines 555565) cells were resuspended in DMEM (Corning 10-013-CV) media supplemented with 50 µM β-mercaptoethanol (Sigma) and 20 µM OPP (Thermo Scientific C10456). Cells were incubated for 45 minutes at 37°C and then washed with Ca2+ and Mg2+ free PBS. No additional cytokines were added.

      We did not perform polysome profiles. Polysome profiling of mutant stem and progenitor cells would be very challenging, as their numbers are much reduced. We now deem this of reduced interest, given the conclusion of the revised manuscript that RpS12 haploinsufficiency reduces overall translation. Also, because in RpS12-floxed/+;SCL-CRE-ERT mouse model with acute deletion of RpS12 we observed the expected decrease in translation in HSCs using the same ex vivo OPP protocol, we did not follow up with in vivo OPP treatment,

    1. Author Response:

      Reviewer #1 (Public Review):

      "Modality-specific tracking of attention and sensory statistics in the human electrophysiological spectral exponent," Waschke et al. This paper follows upon a recent paper by a subset of the same authors that laid out the signal processing-bases for decomposing the EEG signal into periodic (i.e., "oscillatory") and aperiodic components (Donoghue et al., 2020). Here, the focus is on establishing physiological and functional interpretations of one of these aperiodic components: the exponent term of the 1/f(to the x power) fit to the power spectrum (a.k.a., its 'slope'). This is very important work that will have strong and lasting impact on how people design and interpret the results from EEG experiments, and is also likely to trigger many reanalyses of previously published data sets. However, the manuscript could do a better job of explain WHY this is so. In this reviewer's opinion, more linkage with elements of Donoghue et al. (2020). would help considerably.

      First, a brief summary of what this manuscript does, and why it is important. The first section reanalyzes data sets in human subjects undergoing ketamine or propofol anaesthesia, known to influence the E:I balance in the neural circuits that give rise to the EEG. This is an important step in establishing the physiological validity of the fundamental proposition that flattening of the 1/f component reflects an increase in the E:I balance whereas steepening reflects a decrease. This is because these effects of these two anaesthetic agents has been well established in several invasive studies. The second section demonstrates the functional properties of 1/f slope, in that tracks shifts of attention between visual and auditory stimuli in an electrode-specific manner (i.e., posterior for visual, central for auditory), and it also captures aperiodic stucture in these stimuli. It's not too strong to say that, after this paper, EEG-related research will never be the same again. The reason for this, however, isn't stated as clearly as it could be.

      Thank you for your positive appraisal of our work! We appreciate that you see significant benefit to this work, and also understand that you see significant room from improvements in the way results are presented, framed and discussed and want to express our thanks for these helpful comments. Below, we elaborate on them and the changes they prompted in greater detail.

      With regard to exposition, the manuscript could be improved in terms of building on Donoghue et al. (2020). To simplify, a main take-away from Donoghue et al. (2020) is that many past interpretations of EEG signals have mistakenly attributed to task- (or state-) related changes to changes in one or more oscillatory components of the signal. Perhaps most egregiously, what can appear as a change in power in the alpha band can often be shown to be better explained as no change in alpha but instead a change in either the slope or the offset of the 1/f component of the power spectrum. (E.g., the bump at 10 Hz will increase or decrease if the slope of the 1/f component changes, even though the 'true' oscillator centered at 10 Hz hasn't changed.) In this paper, the authors demonstrate that many conditions, physiological state and cognitive challenge, influence 1/f slope in ways that are systematic and that occur independent of changes that may or may not be occuring simultaneously in oscillatory alpha. Broadly, the authors should consider two modifications: first, point out for each key experimental finding how attributing everything to changes in oscillatory alpha (or sometimes other frequencies) would lead to flawed inference; second, don't stop at demonstrating that the slope effects hold when alpha dynamics are partialed out, but also report the converse -- in what ways is oscillatory alpha sensitive to aspects of physiology and/or behavior that 1/f slope is not? Even if there aren't any such cases (which seems unlikely) it would be informative for this to be tested and reported.

      We agree that a stronger focus on the differentiation between oscillatory and 1/f aspects of EEG activity can help to improve the didactic strength of our manuscript. Wherever possible, we have tried to make clear that the separation of different oscillatory activity and aperiodic signals is essential to not confuse one for the other. This is not only the case for the analysis of anaesthesia data were changes in alpha and beta power have to be separated from changes in spectral exponent but also applies to the proposed attention contrast where common effects of alpha power have to be taken into account and differentiated from spectral exponents. Similarly, an alignment of stimulus spectra with EEG activity could appear as a twofold power change (e.g., increase over low, decrease over high frequencies) if no separation of oscillatory and aperiodic signal parts is performed.

      We agree that explicitly contrasting spectral exponents with estimates of low-frequency or alpha power is essential. The original version of the manuscript already included such a comparison for the effect of attention on EEG spectral exponents and alpha power, respectively. To expand this approach, we inverted models and used stimulus spectral exponents (auditory or visual) as dependent variables while using either EEG spectral exponents, low-frequency power or alpha power as predictors (among the same covariates as in the winning models of the original approach). In a next step, we used likelihood ratio tests to compare model fit separately at each electrode, resulting in a topography of model comparisons.

      (a) Attention contrasts

      As expected, based on decades of EEG research, and as can be seen in figure 3C, average EEG alpha power changed as a function of attentional focus, in a topographically specific manner. Importantly, the observed increase of alpha power from auditory to visual attention took place over and above the reported changes in EEG spectral exponents (as we had reported in the control analyses section). In other words, both EEG spectral exponents and EEG alpha power capture attention-related changes in brain dynamics, but are at least partially sensitive to distinct sources or mechanisms. In the updated version of the manuscript, we emphasize that changes in spectral exponents often can be mistaken for changes in alpha power (as in Donoghue et al., 2020), calling for a dedicated spectral parameterization approach. Attention-related changes in spectral exponents and alpha power might depict results of distinct modes of thalamic activity that transitions from tonic to bursty firing and shapes cortical activity to selectively process attended sensory input. In the updated version of the manuscript, we discuss the potential role of thalamic activity in greater detail. The updated parts of the discussion section are pasted below for convenience.

      “Despite these differences in the sensitivity of EEG signals, our results provide clear evidence for a modality-specific flattening of EEG spectra through the selective allocation of attentional resources. This attention allocation likely surfaces as subtle changes in E:I balance (Borgers et al., 2005; Harris and Thiele, 2011). Importantly, these results cannot be explained by observed attention-dependent differences in neural alpha power (8–12 Hz, Fig 3) which have been suggested to capture cortical inhibition or idling states (Cooper et al., 2003; Pfurtscheller et al., 1996). Also note that the employed spectral parameterization approach enabled to us to separate 1/f like signals from oscillatory activity and hence offered distinct estimates of spectral exponent and alpha power that would otherwise have been conflated (Donoghue et al., 2020).

      How could attentional goals come to shape spectral exponents and alpha oscillations? Both attention-related changes in EEG activity might trace back to distinct functions of thalamo-cortical circuits. On the one hand, bursts of thalamic activity that project towards sensory cortical areas might sculpt cortical excitability in an attention-dependent manner by inhibiting irrelevant distracting information (Klimesch et al., 2007; Saalmann and Kastner, 2011). On the other hand, tonic thalamic activity likely drives cortical desynchronization via glutamatergic projections and, with attentional focus, results in boosted representations of stimulus information within brain signals (Cohen and Maunsell, 2011; Harris and Thiele, 2011; Sherman, 2001).

      Our findings of separate attentional modulations of both, EEG spectral exponents and alpha power, point towards the involvement of both thalamic modes in the realization of attentional states. Recently, momentary trade-offs between both modes of thalamic activity have been suggested to give way to attention-related modulations of alpha power and E:I balance, as captured by EEG spectral exponents (Kosciessa et al., 2021). Here, task difficulty remained constant throughout the experiment an fluctuations between both modes might not follow momentary demand (Kosciessa et al., 2021; Pettine et al., 2021) but varying sensory-cognitive resources.

      Additionally, modulations of both alpha power and EEG spectral exponents appeared uncorrelated across individuals - further evidence that they reflect separate neural sources. Future studies that combine a systemic manipulation of E:I (e.g., through GABAergic agonists) with the investigation of attentional load in humans are needed to specify with greater detail how thalamic activity modes drive alpha oscillations and EEG spectral exponents. Specifying potential demand- and resource-dependent trade-offs between different modes of attention-related modulations of cortical activity and sensory processing will offer crucial insights into the neural basis of adaptive behaviour.”

      (b) Stimulus spectral exponent tracking

      We inverted all models and instead of modelling EEG spectral exponents, we used auditory or visual stimulus exponents as dependent variables. Predictors were identical to the previously reported models (see supplementary table for all details) but additionally included either single trial estimates of alpha power, low-frequency power, or EEG spectral exponents. Note that alpha power estimates were extracted using the same spectral parameterization approach that was used to estimate spectral exponents. Trials without an oscillation in the alpha range were excluded from all models to render likelihood comparisons interpretable (11.2%  3.4 %). Since oscillations were only seldomly detected in the low-frequency range (1–5 Hz), we instead used single trial power averaged across this range. For each electrode, 4 likelihood ratio tests were performed, one for each stimulus modality and one for each predictor (low-frequency or alpha power). Strikingly, low-frequency power resulted in worse model fits (non-positive likelihood ratio test statistics) compared to EEG spectral exponents across all electrodes and both stimulus modalities. The same was true for EEG alpha power when modelling auditory stimulus exponents. However, when modelling visual stimulus exponents, EEG alpha power displayed significantly improved model fit at one parietal electrode. In line with this observation, we observed a positive relationship between single trial alpha power and visual stimulus exponents at this parietal site (see below).

      Figure R5 Model comparison topographies. (a) Single trial auditory (upper row) or visual stimulus exponents (lower row) were modelled based on electrode wise low frequency power (left column) or alpha power (right) column, among other covariates. Models were compare d to a model of same size that only differed in the main predictor that consisted of single trial EEG spectral exponents. Topographies display the likelihood ratio test statistic, illustrating no improvements in model fit compared to EEG spectral exponent based models in all but one model family, illustrating the unique predictive power of aperiodic EEG activity in this context. Alpha power at one parietal electrode explained significantly more variance in visual stimulus exponents. (b) T values representi ng the main effect of alpha power on visual stimulus exponents. Highlighted electrode represents p< .05 after FDR correction.

      (c) Behavioural relevance of spectral exponent tracking

      Given the results from (b), we refrained from re-running PLS analysis focussing on the behavioural relevance of the links between low-frequency and alpha power with stimulus exponents. In our view, the absence of a significant link between single trial stimulus input and a measure of neural activity in this case precludes any further analysis on the between-subject level.

      Reviewer #2 (Public Review):

      The paper investigates two separate studies looking at the spectral exponent of the EEG 1/f-like spectrum: one a study of the effect of anesthesia type (propofol vs. ketamine), using publicly available data, and the other a traditional study of auditory and visual processing relying on selective attention to one modality vs. the other. The authors make a strong case that the value of the spectral exponent depends on the relevant condition, in both studies, but the case for the spectral exponent's dependence on the Excitation:Inhibition balance is much weaker.

      The paper presents the two separate studies as tightly linked, but by the end of the paper it appears they may be quite separate.

      The anesthesia study is brief and compelling. With respect to the effect of anesthesia type on spectral exponent, the results are very strong, and, given the results of Gao et al. (2017) and the stated properties of propofol vs. ketamine, the connection to E:I balance follows naturally.

      The auditory and spectral 1/f tracking study suffers from some weaknesses.

      Most importantly, the design is elegant and the results presented are very compelling. 1) Modality-specific attention selectively reduces the EEG spectral exponent (for relevant electrodes reflecting cortical processing of that modality); 2) Changing the value of the spectral exponent in the stimulus results in a similar change in the value of the spectral exponent of the response, but only for the selectively attended modality (and only for relevant electrodes); and 3) the amount of modality-specific spectral-exponent tracking predicts behavior. The interactions and main effects found all support the importance of the spectral exponent as a physiologically and behaviorally important index.

      The main problem is a weakness in analysis regarding whether the mechanistic origin of the above effects may be due to temporal tracking of the stimulus waveform (visual contrast/acoustic envelope) by the response waveform. [In the speech literature this would be referred to as "speech tracking", or, sometimes, as speech entrainment (in the weak sense of "entrainment").] As pointed out by the authors, this is not a steady state response because the instantaneous fluctuation rate of the stimulus is constantly changing, and so cannot be analyzed as such (it is also distinct from the evoked responses analyzed). But it is a good match for other analysis methods, for instance Ed Lalor's VESPA and AESPA methods, and their reverse-correlation descendants. Specifically, Lalor et al., 2009 analyzed EEG responses to a non-sinusoidal envelope modulation of a broadband noise carrier and found strong evidence for robust temporal locking. The success of such linear methods there (AESPA for auditory; VESPA for visual) implies that a change in the stimulus spectrum exponent would produce a similar change in the response spectrum exponent, having nothing to do with E:I balance.

      The evoked response analysis clearly aims to go in this direction, but since it does not reflect ongoing response properties, it cannot alone speak to this.

      Because this plausible mechanism for the spectral-exponent-tracking has not been explored, it is much harder to associate the observed spectral-exponent-tracking as originating from E:I balance. The study does not then hold together well with the anesthesia study, and weakens the links to E:I balance rather than strengthening it.

      Thank you for this in-depth assessment of our work and your general positive appraisal of it. Importantly, your major point of concern seems to at least partially trace back to a regrettable misunderstanding caused by the way we presented our results in the original version of the manuscript. While the first study aimed at establishing the validity of the EEG spectral exponent as a non-invasive marker of E:I, the second study had two objectives. First, to test attention-related changes in EEG spectral exponents that we assume to depict topographically specific changes in E:I. Second, to test the link between aperiodic stimulus features and aperiodic EEG activity by comparing stimulus spectral exponents and EEG spectral exponents. We understand that the reviewer is doubtful of the link between stimulus-related EEG spectral exponent changes and E:I – and so are we.

      In the updated version of the manuscript, we have tried to make it very clear that despite the displayed and inferred links between EEG spectral exponents and E:I balance, the positive relationship between stimulus spectral exponents and EEG spectral exponents does not necessarily reflect changes in E:I. Nevertheless, we feel that study 1 and 2 integrate well as they offer a comprehensive view on 1/f-like EEG activity and its sensitivity to (1) specific anaesthesia effects, (2) attentional focus, and (3) aperiodic stimulus features in a behaviourally-relevant way. While (1) and (2) can be mapped on to one underlying mechanism, cortical E:I balance, (3) rather represents bottom-up sensory cortical effects similar to those described in SSEP or speech tracking literature. The interaction of attentional focus and stimulus tracking illustrates the connection between top-down (or anaesthesia-driven) changes in E:I as captured by the EEG spectral exponent, and bottom-up sensory-related changes in EEG activity.

      Reviewer #3 (Public Review):

      The balance between excitation and inhibition in the cortex is an interesting topic, and it has already been a focus of study for a while. The current manuscript focuses on the 1/f slope of the EEG spectra as the neural substrate of the change in the balance between excitation and inhibition. While the approach they use to analyze their data is interesting, unfortunately, for the reasons I'll outline below the study's conclusions are not supported by the data, and the findings do not add any new insight conceptually or mechanistically to our understanding of attention, excitation or inhibition. While the study aims to "test the conjecture that 1/f-like EEG activity captures changes in the E:I balance of underlying neural populations.", ultimately the central conclusions of the work is just conjecture in that they are inference formed without sufficient evidence.

      Anaesthesia study: EEG spectral exponents as a non-invasive approximation of E:I balance The authors observe the 1/f slope was different over pre-selected central electrodes sites between 4 participants undergoing ketamine and propofol anaesthesia. The rather small sample size is a cause for concern, as are the authors' rationale for looking at the central electrodes -they claim these electrodes receive contributions from many cortical and subcortical sources, but that can be said of any other electrodes at the scalp. But I believe the most critical weakness here is the authors' claim that during anaesthesia is that propofol is "known" to result in a "net" increase of inhibition, while ketamine an increase in net excitation. We still know very little about what neurophysiologically is happening under anaesthesia and the concept of "net" inhibition and excitation is rather a gross simplification of what happens to the central nervous system under these two agents. Just as an example, propofol has been found to have some excitatory influence on brain function, with dosage of the anaesthetic also playing role: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2717965/. On the other hand, ketamine has been observed to inhibit interneurons and cortical stimulus-locked responses, but cause excitation in the auditory cortex : https://physoc.onlinelibrary.wiley.com/doi/10.1113/JP279705.

      Suffice to say the interaction between anaesthetic agents and the brain is rather complex. Decades of research has shown that the EEG spectra changes during anaesthesia. To rather arbitrarily say one agent has a net inhibitory impact while another excitatory impact, then link those to qualitative changes in the EEG spectra of 4 participants, and further link that back to E:I ratio is committing the scientific fallacy of Begging the Claim.

      We thank the reviewer for their insightful comments. Of course, we do not wish to challenge the complex nature of anaesthetic effects by any means and apologize if the original version of our manuscript had left that impression. Below, we outline that despite the complex impact of anaesthesia on central nervous activity, there exists plenty of evidence justifying our assumption of differentially altered E:I balance through propofol and ketamine, at least in cortical areas.

      First of all, we agree with the reviewer that a change in E:I balance certainly is not the only change that takes place in the central nervous system during anaesthesia. As has been shown before, propofol and ketamine affect the overall level of neural activity (Taub et al., 2013) and spiking (Quirk et al., 2209; Kajiwara et al., 2020), propofol is associated with frontal alpha oscillations and widespread changes in beta power (Purdon et al., 2012). In the updated version of the manuscript, we have added notions to these common patterns and discuss the oscillatory changes we observe in the current dataset.

      Importantly, while there might not be a single identifiable mechanism behind the host of different anaesthesia-induced changes in brain activity, there is relative clarity on the fact that higher doses of propofol drive a change in excitatory and inhibitory activity towards inhibition whereas ketamine drives disinhibition and hence shifts E:I towards excitation. In fact, the study by Deane et al. (2020) reports increased excitation and disinhibition in auditory cortex during ketamine anaesthesia, accompanied by stronger (not weaker, as stated by the reviewer) evoked responses. These findings speak to the validity of the simplification of a net increase of excitation under ketamine anaesthesia. Furthermore, the modelling results by McCarthy et al. (2008) target a dose- and cell-ensemble specific effect of propofol anaesthesia: paradoxical excitation. The observation that low doses of propofol can induce a temporary increase of excitatory activity is in stark contrast to the general GABA-A-potentiating and hence inhibiting nature of propofol (Concas et al., 1991). Importantly, however, higher doses of propofol as used in the analysed dataset are widely accepted to lead to relatively increased inhibition, even after initial paradoxical excitation (Concas et al., 1991; Zhang et al., 2009; Brown et al., 2011; Ching et al., 2010). Taken together, previous invasive physiology justifies the simplification of propofol as leading to net increased inhibition and ketamine leading to net excitation. Finally, our focus on the spectral exponent does not stem from a disregard of oscillatory changes in EEG activity but rather strictly follows from previous work that demonstrated the spectral exponent as a marker of E:I balance (Gao et al., 2017; Colombo et al., 2020; Lendner et al., 2021; Chini et al., 2021). Hence, the central goal of the presented analyses and results lies in the transfer of these previous results to non-invasive EEG recordings and the parameterization approach used by us. We hope that this becomes clearer in the updated version of the manuscript and have pasted relevant parts below.

      “Both anaesthetics exert widespread effects on the overall level of neural activity (Taub et al., 2013) as well as on oscillatory activity in the range of alpha and beta (8–12 Hz; ~15–30 Hz). Importantly, however, propofol is known to commonly result in a net increase of inhibition (Concas et al., 1991; Franks, 2008) whereas ketamine results in a relative increase of excitation (Deane et al., 2020; Miller et al., 2016). In accordance with invasive work and single cell modelling (Chini et al., 2021; Gao et al., 2017), propofol anaesthesia should thus lead to an increase in the spectral exponent (steepening of the spectrum) and ketamine anaesthesia to a decrease (flattening). Based on previous results, the effect of anaesthesia on EEG spectral exponents is expected to be highly consistent and display little topographical variation (Lendner et al., 2020). For simplicity, we focused on a set of 5 central electrodes that receive contributions from many cortical and subcortical sources (see Fig 1) but report topographically-resolved effects in the supplements (see Fig 1 supplement 1). Here, propofol anaesthesia led to an overall increase in EEG power which was especially pronounced in the alpha-beta range. Ketamine anaesthesia decreased the frequency of alpha oscillations and supressed power in the beta range. Importantly, however, EEG spectral exponents that were estimated while accounting for changes in oscillatory activity increased under propofol and decreased under ketamine anaesthesia in all participants (both ppermuted < .0009, Fig 1). These results replicate previous invasive findings and support the validity of EEG spectral exponents as markers of overall E:I balance in humans.”

      “[…] While the EEG spectral exponent as a remote, summary measure of brain electric activity can obviously not quantify local E:I in a given neural population, the non-invasive approximation demonstrated here enables inferences on global neural processes previously only accessible in animals and using invasive methods. Future studies should use a larger sample to directly compare dose-response relationships between GABA-A agonists or antagonists (e.g., Flumanezil) and the EEG spectral exponent as well as common oscillatory changes.”

      Regarding the reviewer’s comment on our choice of electrodes we first wish to highlight that several previous studies have revealed that anaesthesia effects commonly appear throughout the cortex of humans (Zhang et al., 2009; Lendner et al., 2020). Nevertheless, we understand that a priori choices of electrodes always are arbitrary to some degree. Hence, we performed pairwise comparisons of EEG spectral exponents between awake rest and anaesthesia (ketamine vs. propofol) at all 60 electrodes, resulting in the topographies of t-values shown below. As can be discerned from these topographies, ketamine anaesthesia entailed a reduction of spectral exponents across most areas of the scalp, peaking at frontal and central sites. Propofol led to increased EEG spectral components across all electrodes without a clear spatial pattern. The absence of an effect at the left mastoid likely traces back to artefactual recordings at that electrode site. In the updated version of the manuscript, we report topographies of comparisons in the supplements (figure 1 supplement 2).

      Figure R8 Topographically resolved t statistics comparing EEG spectral exponents between awake rest and different anaesthetics. Propofol leads to a wide spread increase in spectral exponents that is present across the entire scalp (left). Ketamine leads to a reduction in spectral exponents that is widely distributed but appears to peak at frontal and central electrodes (right).

      We acknowledge the small sample size of study 1 and have also added a more explicit notion to that in the updated version of our manuscript. Nevertheless, due to their consistency and the used permutation-based statistics which are appropriate for small sample sizes, the results of study can be interpreted. Furthermore, we realized that we had not included two additional participants of the publicly available dataset in our previous analysis. Both sets of recordings (ketamine / propofol) were included in the revised analyses of the data, further strengthening the reported results. Hence, despite the small sample size (now N = 5 per group), we believe that the used methods and the consistency of effects allows for a careful but clear interpretation, especially since they are in close agreement with previous invasive and modelling results as well as recent causal manipulation studies (Gao et al., 2017; Chini et al., 2021).

      Cross-modal study: EEG spectral exponents track modality-specific, attention-induced changes in E:I Here the authors observe a difference in 1/f slope depending on if the participants (n=24) were paying attention to the auditory or visual stream. My central issue here is again with the authors' assumptions: cross-modal attention reflects attention-induced E/I. While attention to a single sensory modality can result in decreased activity in cortical regions that process information from an unattended sensory modality, there is no basis here to say that the task-irrelevant region is actually inhibited. The authors here do observe differences in 1/f slope as a function of attentional location, and these differences do account for some of the variances in behavior in the task.

      But unfortunately other than a purely descriptive exercise, there is not any sort of mechanistic insight is revealed here with regards to attentional allocation, excitation, and inhibition.

      We wish to take this opportunity to briefly elaborate on our hypotheses behind the reported attention contrasts and their interpretation. Spectral exponents of invasively recorded neural field potentials have previously been shown to reflect pronounced changes in E:I balance, including recent causal optogenetic work explicitly testing this link (Gao et al., 2017; Chini, Pfeffer & Haganu-Opatz 2021). In a first step, we analysed data from different anaesthetics to establish the potency of non-invasive EEG recordings to track similar changes (see above). Building on these findings, we tested whether smaller, attention-related and topographically-specific changes in E:I balance can equally be observed by means of EEG spectral exponent changes. Importantly, topographically concise changes in E:I with attention have been reported previously in non-human animals (e.g., Kanashiro et al., 2017; Ni et al., 2018). We found an attention-related topographical pattern of EEG spectral exponents in support of such an idea: spectral exponents at occipital channels decreased during visual attention, pointing towards a relative increase of excitatory activity in visual cortical areas. The same effect was reduced at central electrodes and for auditory attention. These findings demonstrate the potency EEG spectral exponents to detect topographically-specific attention-related changes in brain activity that likely trace back to changes in E:I balance. Of note, we do not imply a role of E:I in the inhibition of unattended sensory input and activity in associated cortical areas but rather point to a potentially separate role of neural alpha power in this context. While it is generally difficult to draw strictly mechanistic insights based on correlational designs, our results at least strongly suggest a mechanistic role of modality-specific attention for EEG dynamics and E:I balance. Furthermore, by demonstrating separate effects of aperiodic activity and alpha power dynamics, we pave the way for a new line of studies (see comments by R1) on the neural dynamics of selective attention and their behavioural relevance in humans.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors have used many cleverly chosen mouse models (periodontitis models; various models that lead to an on-switch of genes) and methods (immune localizations of high quality; single cell RNA sequencing) for the quest of elucidating a role for telocytes. They describe that more telocytes are present around teeth in mice that had periodontitis. These cells proliferated, and they expressed a pattern of genes that allowed macrophages to differentiate into a different direction. In particular, they showed that telocytes in periodontitis express HGF, a molecule that steers macrophage differentiation towards a less inflammatory cell type, paving the way for recovery. As a weakness, one could state that an attempt to extrapolate to human cells is missing.

      In the Discussion, we have a sentence that states further investigation in human periodontitis is required (see page 20, paragraph 416).

      Reviewer #3 (Public Review):

      Zhao and Sharpe identified telocytes in the periodontium. To address their contribution to periodontal diseases, they conducted scRNA-seq analysis and lineage tracing in mice. They demonstrated that telocytes are activated in periodontitis. The activated telocytes send HGF signals to surrounding macrophages, converting M2 to M1/M2 hybrid status. The study implies that targeting telocytes and HGF signal for the potential treatment of periodontitis.

      The significance of the study could be improved by authors testing if targeting telocytes or HGF signals could ameliorate periodontitis in the mouse model. The current form of the manuscript lacks the data that demonstrate the actual contribution of telocytes in the homeostasis of periodontium or progression of periodontitis.

      Major comments:

      1) I see the genetic validation of the role of telocytes or HGF signals are crucial to assure the significance of this manuscript. I recommend either of two experiments. a. testing the role of HGF signals by deleting the Hgf gene in telocytes. Using Wnt11-Cre; Hgf f/f mice, the authors could address the role of HGF signals in periodontitis. CX3CR1-Cre; cMet f/f mice will delete HGF signals in monocyte-derived macrophages. This will be another verification, but not sure if the PDL macrophages are derived from yolk sac or monocytes. b. measuring the contribution of telocytes in the homeostasis or disease progression. The mouse model could be challenging though, the system if achieved will be very informative. The authors could first check the expression of telocyte enriched genes, such as Lgr5 or Foxl1 reported previously in other tissue telocytes. Delete those genes under the Wnt1-Cre driver and check if telocyte lineage is removed. The system would be very useful for next-level study. DTA model could be an alternative, but Wnt1-Cre is vastly expressed in neural crest lineage.

      These are good suggestions but unfortunately not feasible as we do not have all the mouse lines (e.g., Hgf f/f mice). Lgr5 and Foxl1 are used in intestine but is not suitable for PDL tissue. CD34;DTA show CD34+ cells, however, we encountered challenges associated with induced genetic heterogeneity when using this model, preventing us from making concrete conclusions from the experiments using the CD34;DTA model. Lgf5/Foxl1 are either not expressed or overlap with CD34 in and therefore do not seem suitable for us to pursue.

      2) This paper points out that the M1/M2 hybrid state of macrophages appears upon periodontitis. The authors could further characterize the hybrid macrophages by the expression of more markers, production of cytokines, and morphology. Need to clarify if this means some macrophages are in M1 state and others are in M2 state, or one macrophage possesses both M1 and M2 phenotype. Please conduct either FACS or immunofluorescence to demonstrate if one macrophage expresses both markers. Please introduce more information about the M1/M2 hybrid state of macrophage based on other present literature.

      Unlike our single cell sequencing data, we were unsuccessful in determining if one macrophage possesses both M1 and M2 phenotype by immunolabelling.

      3) In the introduction part, the author lists several markers that can be used for telocyte identification, such as CD34+CD31-, CD34+c-Kit+, CD34+Vim+, CD34+PDGFRα+. Could authors explain why they chose CD34 CD31, but not other markers?

      As shown in the cluster images below, the other markers do not overlap very well with CD34 cells or in the case of Vim, expressed more ubiquitously. We generated a new supplementary figure (Supp Fig2) and explained this in the text (page 12, lines 235-238).

      4) In figure 5g, I don't think the yellow color cell shows the reduction trend in the Tivantinib treatment group compared with a control group. Please validate the observation by gene expression analysis, WB, etc. In addition, please show c-Met+ cells level in the Tivantinib treatment group and control group.

      New Supp Fig4 is included to show Met expression in homeostasis and periodontitis.

    1. Author Response

      Reviewer #1 (Public Review):

      The tools and approaches in this manuscript are of broad interest, not only to protein engineers but also to the many researchers using genome-editing reagents. However, putting the work in the context of previous research, both through changing the writing and additional experiments, will be critical for taking advantage of that widespread applicability.

      Strengths:

      Overall, the data support the conclusions of the manuscript.

      The most exciting product of this work is an engineered nuclease, Nsp2-SmuCas9, that has high activity and specificity in human cells and a relaxed PAM preference for a single C base. This chimeric enzyme can efficiently induce indels at endogenous sites. While other works have presented nucleases with minimal PAM preferences, Nsp2-SmuCas9 is a useful alternative and may be preferred. It is also more compact than the standard SpCas9, making it appealing for gene therapy applications.

      Technologically, the presented approach of screening orthologs for new specificities and making chimeras to achieve further diversity is a good way to develop new genome-editing reagents. The authors used appropriate methods, such as GUIDE-seq, to complete their goals. Extending beyond the GFP-activation assay to determine activity at endogenous targets enhanced the value of the results.

      Conceptually, it was important information to the field that proteins with very high sequence identity (93%) can have divergent PAM preferences. Through their engineering, the authors clearly demonstrate the advantage of characterizing such close orthologs with diverse amino acids in the area of PAM recognition.

      Weaknesses:

      1) An overall weakness with the work is that it is not clear how the activity level of the relaxed PAM enzyme, Nsp2-SmuCas9, compares to existing enzymes. Is it much better than the SpCas9 that has almost no PAM preference (SpRY) or the NGN PAM (SpG)? How does it compare to the most commonly used SpCas9 nuclease, which is known to be active in a wide variety of biological contexts? The activity assessment at endogenous sites seemed to have a long timeline, as the indel rate was measured 5 days after transfection. Clarifying the effectiveness of this new nuclease would increase the impact of this work.

      We sincerely thank the reviewer for the constructive comments on our manuscript. Following reviewer’s suggestions, we compared the editing efficiency of Nsp2Cas9, Nsp2-SmuCas9, SpCas9, SpCas9-NG, and SpCas9-RY side-by-side. Overall, the editing efficiency was low this time probably due to low transfection efficiency. The results revealed that SpCas9 was the most active enzyme. Nsp2Cas9, SpCas9-NG, and SpCas9-RY displayed similar activity. Nsp2-SmuCas9 displayed lower activities than other Cas9 variants (Figure 5C).

      2) In the presentation of the manuscript, there are several weaknesses. First, while it is true that allele-specific disruption is an important application of new CRISPR proteins, there are many other reasons why they would be useful. The specific focus on this single application throughout the abstract, introduction and discussion takes away from the widespread utility of these new tools. The writing would be more compelling if it targeted a broader audience. Allele-specific targeting is also possible beyond the PAM site if the mutation is in a position with high specificity.

      Many thanks for the reviewer’s suggestions. Following reviewer’s suggestions, we emphasize the widespread utility of these new tools throughout the abstract, introduction, and discussion in the revised manuscript. Allele-specific targeting is only mentioned in the discussion.

      3) Second, the introduction is further missing a discussion of other research engineering new PAM specificity or even completely removing specificity. A more convincing narrative would include reasoning for why characterizing naturally occurring orthologs is a powerful and important approach. This information is in the discussion, but it would be helpful for the reader if these points were in the introduction.

      Many thanks for the reviewer’s comments. Following reviewer’s suggestions, we added other research engineering new PAM specificity in the introduction. We also included reasoning for why characterizing naturally occurring orthologs is a powerful and important approach.

      “Engineered Cas9 variants with flexible PAMs can increase targeting scope. For example, SaCas9 was engineered to accept an NNNRRT PAM [1]; SpCas9 was engineered to accept almost all PAMs [2], but this strategy is time-consuming, and often comes at a cost of reduced on-target activity. Another strategy is to harness natural Cas9 nucleases for genome editing. We have developed several closely related Cas9 orthologs for genome editing [3, 4]. The advantage of developing tools from closely related Cas9 orthologs is that they can exchange the PAM-interacting (PI) domain. If an ortholog recognizes a particular PAM but does not work efficiently in human cells, we can use this ortholog PI to replace another ortholog PI to generate a chimeric Cas9.”

      4) A second concern with the presentation and analysis of the findings is a minimal connection to the structural context of the discoveries. Many readers will likely be interested in how the specificity shifts are occurring in these orthologs, which could be remedied by supplementary figures of homology models.

      We totally agree with the referee that structural models would help readers better understand the specificity shifts occurring in these orthologs. We have generated calculated structural models of these orthologs in complex with sgRNA and DNA using the crystal structure of Nme1Cas9 (PDB ID: 6JDV). Some specificity shifts can be well explained by these structural models. When the amino acid near the 5 position of the PAM is histidine, its side chain forms a potential hydrogen bond with the 6-hydroxyl group of guanine. Replacement of this guanine by cytosine or thymine would cause a major clash, whereas adenine lacks the hydroxyl group to form hydrogen bond with the histidine (Figure 2-figure supplement 2A). Likewise, an aspartate at 5 position of the PAM would favor a specific recognition of cytosine via hydrogen bonding with its 4-amine group, but not of other bases that may either result in major clash or abolish the hydrogen bond (Figure 2-figure supplement 2B). Similar explanation applies also to the apparent specificity between glutamine and adenine at the 8 position of the PAM on the target sequence (Figure 2-figure supplement 2C).

      5) Along the same lines, further structural analysis of the failures would be helpful for those embarking on similar projects. Are there any differences in the sequence or structure of the 4/29 orthologs that were not functional in the GFP-activation assay compared to those that were?

      Sequence alignment indicates that the four inactive orthologs possess intact active sites. In the predicted structural models of these orthologs, we did not observe local conformational variations that preclude the interaction with sgRNA or DNA. Sequence alignment indicates that the four inactive orthologs possess intact active sites. In the predicted structural models of these orthologs, we did not observe local conformational variations that preclude the interaction with sgRNA or DNA. We speculate that specific modifications of Cas9s in mammalian cells may occur, leading to the loss of enzymatic activities of the 4 orthologs.

      Calculated structural models of AseCas9, Hpa1Cas9, MspCas9, and PlaCas9. Overall calculated structures of AseCas9, Hpa1Cas9, MspCas9, and PlaCas9 with sgRNA and dsDNA.

      6) Similarly, it was surprising that the Nsp2-NarCas9 chimera was not active, and it would be helpful if the authors could speculate based on the differences between SmuCas9 and NarCas9, such as at the interface of the domains that were fused. Structural models of the fusions would help the reader to visualize the strategy. Exploring the failures and challenges is important for understanding the generalizability of the presented approach.

      Following reviewer’s comments, we generated structural models of Nsp2-NarCas9, Nsp2-SmuCas9, and NarCas9 using the crystal structure of highly homologous Nme1Cas9 in complex with sgRNA and dsDNA (PDB ID: 6JDV) as the template by SWISS-MODEL. By superimposing these models, we noticed that residues G1035, K1037 and T1038 of Nsp2-NarCas9 chimera protrude towards the DNA molecule, which would prevent the binding with DNA and thereby abolishing the editing activity (Figure 4-figure supplement 2A). In comparison, Nsp2-SmuCas9 and NarCas9, which possess the Cas activity, show no protrusion at the corresponding position (Figure 4-figure supplement 2B-C).

      7) Finally, the final sequence of Nsp2-SmuCas9 fusion, as well as other enzymes such as the failed Nsp2-NarCas9, are not obvious in the manuscript. I may have missed them, but I also did not see the primers used in the Methods section. Addgene submission is also encouraged and would be of great value to the scientific community.

      Thank you for your suggestions. The final sequence of Nsp2-SmuCas9, as well as other enzymes, have been provided in Supplemental file 1. The primers for chimera proteins were listed in Supplemental file 1. We will submit plasmids to Addgene soon.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Li et al characterize sex differences in the impact of macrophage RELMa in protection against diet-induced obesity [DIO]. This is a key area of interest as obesity studies in mice have generally focused exclusively on male animals, as they tend to gain more weight, faster than female mice. The authors use a combination of flow cytometry, adoptive transfer, and single-cell transcriptomics to characterize the mechanism of action for female-specific DIO protection. They identify a potential role for eosinophils in mediating female DIO protection downstream of RELMa production by macrophage. They also use the transcriptomic characterization of the stromal vascular fraction of the adipose tissue to evaluate molecular and cellular drivers of this sex-specific DIO protection.

      Although the authors provide solid evidence for many claims in the manuscript, there is generally not enough information about the studies' methods (especially on the computational/data analysis aspects) for a careful evaluation of the result's robustness at this stage.

      We have significantly expanded the methodology, especially of the scRNAseq, and deposited the script and raw data in public repositories. We also validated our methods and can confirm that the analysis presented is robust. This resubmission contains new Fig 7 and new supplementary material with this methodology and validation.

      Reviewer #2 (Public Review):

      In the study by Li et al., the authors hypothesize that RELMa, a macrophage-derived protein, plays a sex-dimorphic role as a protective factor in obesity in females vs males. The authors perform largely in vivo studies utilizing male and female WT and RELMa KO mice on a high-fat diet and perform an in-depth analysis of immune cell composition, gene expression, and single-cell RNA Sequencing. The authors find that WT females are protected from obesity and inflammation vs males, and this protection is lost in female RELMa KO mice. Further analysis by the authors including flow cytometry of the visceral fat SVF in female WT mice showed reduced macrophage infiltration, higher levels of eosinophils, and Th2 cytokine expression compared to WT male mice and female KO mice. The authors show that protection from obesity and inflammation in female RELMa KO mice can be rescued with an injection of eosinophils and recombinant RELMa. Lastly, the authors use single-cell RNA-Sequencing to further analyze SVF cells in WT and KO male and female mice on a high-fat diet.

      Overall, we find that the study represents an important finding in the immunometabolism field showing that RELMa is a key myeloid-derived factor that helps influence the macrophage-eosinophil function in female mice and protects from diet-induced obesity and inflammation in a sexually dimorphic manner. Overall, the study provides strong and convincing data supporting the authors' hypothesis and conclusion.

      We thank the reviewer for their positive review of our manuscript and their helpful feedback which we address below.

      Reviewer #3 (Public Review):

      Li, Ruggiero-Ruff et al. examine the role of RELMα, an anti-inflammatory macrophage signature gene, in mediating sex differences in high-fat diet (HFD)-induced obesity in young mice. Specifically, the authors hypothesize that RELMα protects females against HFD-induced obesity. Comparisons between RELMα-knockout (KO) and wildtype (WT) mice of both sexes revealed sex- and RELMα-specific differences in weight gain, immune cell populations, and inflammatory signaling in response to HFD. RELMα-deficiency in females led to increased weight gain, expansion of pro-inflammatory macrophage populations, and eosinophil loss in response to HFD. Female RELMα-deficiency could be rescued by RELMα treatment or eosinophil transfer. Single-cell RNA-sequencing (scRNA-seq) of adipose stromal vascular fraction (SVF) revealed sex- and RELMα-dependent differences under HFD conditions and identified potential "pro-obesity" and "anti-obesity" genes in a cell-type-specific manner. Using trajectory analysis, the authors suggest dysregulation of macrophage-to-monocyte transition in RELMα-deficient mice.

      The conclusions of this paper are mostly well supported by the data, but some aspects of the statistical and single-cell analyses will need to be corrected, clarified, and extended to enhance the report.

      We thank Dr. Ocanas for their positive comments and for the helpful feedback to improve our study. We have addressed all the comments and significantly revised the manuscript.

      Strengths:

      The authors use several orthogonal approaches (i.e., flow cytometry, immunohistochemistry, scRNA-Seq) and models to support their hypotheses.

      The authors demonstrate that phenotypes observed in HFD-fed females with RELMα-deficiency (i.e., weight gain, loss of eosinophils, a gain of M1 macrophages) can be rescued by RELMα treatment or eosinophil transfer.

      The authors recognized the complexity of macrophage activation that is beyond the 'M1/M2' paradigm and informed readers in the introduction as to why this paradigm was used in this study. During the scRNA-seq analyses, the authors further sub-cluster macrophages to include more granularity.

      Weaknesses:

      1) There are several instances in the text where the authors claim that there is a significant difference between the two groups, but the statistics for these comparisons are not shown in the figure.

      Because we are dealing with three variables: genotype, diet and sex, and many differences, we thought it too complicated to add all the significant differences on the graph, but sometimes just mentioned these in the text with a p value, or didn’t mention at all if the difference was obvious, or not meaningful (for example, we weren’t interested in comparing a WT male on a Ctr diet with a RELMalpha KO female on a HFD for the purpose of our hypothesis). We have now ensured clarity in the text and in the figures, and addressed the specific point-by-point comments from the reviewer. We have also now carefully re-evaluated the text to ensure that any significant differences we discuss are shown in the figure.

      2) It is unfortunate that eosinophils could not be identified in the single-cell analysis since this population of cells was shown to be important in rescuing the RELMα-deficiency in HFD-fed females. The authors should note in the discussion how future scRNA-Seq experiments could overcome this limitation (i.e., enriching immune cells prior to scRNA-Seq).

      We were indeed disappointed that we were not able to obtain eosinophil single cell seq, but realize that this is a reported issue in the field. We have expanded our discussion of this and cited a paper that performs eosinophil single cell sequencing (published at the time our manuscript was being submitted): ““At the same time as our ongoing analysis, the first publication of eosinophil single cell RNA-seq was published, using a flow cytometry based approach rather than 10x, including RNAse inhibitor in the sorting buffer, and performing prior eosinophil enrichment (PMID: 36509106). Based on guidance from 10x, we employed targeted approaches to identify eosinophil clusters according to eosinophil markers (e.g. Siglecf, Prg2, Ccr3, Il5r), and relaxed the scRNA-Seq cutoff analysis to include more cells and intronic content, but still could not find eosinophils. We conclude that eosinophils may be absent due to the enzyme digestion required for SVF isolation and processing for single cell sequencing, which could lead to specific eosinophil population loss due to low RNA content, RNases or cell viability issues. Future experiments would be needed to optimize eosinophil single cell sequencing, based on the recent publication of eosinophil single cell sequencing.”

      3a) There are several issues with the scRNA-Seq analysis and interpretation. More details on the steps taken in the single-cell analyses should be included in the methods section.

      We agree with the reviewer that more details on steps taken in the single cell data processing and bioinformatics needs to be included in the methods section. We included more information and separated sections within the data processing section in the Materials and Methods on the methodology used for these approaches, as well as provided a code for our data processing in a public Github repository: https://github.com/rrugg002/Sexual-dimorphism-in-obesity-is-governed-by-RELM-regulation-of-adipose-macrophages-and-eosinophils.

      b) With regards to the 'pseudobulk' analyses presented in Figs. 5-6, several of the differentially expressed genes identified in Fig. 6 are hemoglobin genes (i.e., Hba, Hbb genes). It is not uncommon to filter these genes out of single-cell analysis since their presence usually indicates red blood cell (RBC) contamination (PMID: 31942070, PMID: 35672358). We would recommend assessing RBC contamination as well as removing Fig. 6 from the manuscript and focusing on cell-type-specific analyses. Re-analysis will likely have an impact on the overall conclusions of the study.

      Prior to our first submission, we consulted with 10x support scientists and the UCR bioinformatics core director to ensure that our analysis included the appropriate filtering. We have now added details in the Methods. The PMIDs provided above are from studies that looked at hippocampus development (where they didn’t perfuse so there may be blood contamination) or whole blood (where there would be significant red blood cell contamination). In contrast, we perfused our mice and treated the single cell suspension with RBC lysis buffer, as detailed in Methods. Also, we have now extended our scSeq analysis to compare hemoglobin RNA to red blood cell specific markers including Gypa/CD235a. While hemoglobin is distributed throughout the myeloid population in the female KO mice, Gypa/CD235a, which would suggest RBC contamination is not expressed at all (see new Fig 7B). Additionally, we provide hemoglobin protein ELISA and IF staining to support our finding that macrophages from KO mice express hemoglobin protein. Last, two publications support hemoglobin expression by nonerythroid sources, including macrophages (PMID: 10359765; PMID: 25431740). While we are confident based on above that our data is not due to RBC contamination, we cannot exclude the fact that, although unlikely, macrophages may be phagocytosing RBC and preserving specifically hemoglobin RNA and protein. Nonetheless, we discuss this possibility in the text. In conclusion, based on the justification above and the new data, we are confident that our findings and overall conclusions are robust.

      To assess for potential RBC contamination, in addition to Gypa, we additionally looked at top genes expressed by murine erythrocytes (PMID: 24637361). Please see below feature plots, showing little to no expression, and a very different distribution than the hemoglobin genes (see new Fig 7a):

      Also, we had a small cluster of potential RBCs (only 75 cells) that we filtered out of downstream DEG analysis, which revealed the same data as in the first submission.

      4) Within the text, there are several instances where the authors claim that a pathway is upregulated based on their Gene Ontology (GO) over-representation analysis (ORA). To come to this conclusion, the authors identify genes that are upregulated in one condition and then perform GO-ORA on these genes. However, the authors do not consider negative regulators, whose upregulation would actually decrease the pathway. Authors should either replace their GO-ORA analysis with one that considers the magnitude and direction of differentially expressed genes and provides an activation z-score (i.e., Ingenuity Pathway Analysis) or replace instances of 'upregulated' or 'downregulated' pathways with 'over-represented' pathways.

      Unfortunately, we did not have access to IPA for this project, therefore we have changed our analysis to over and under-represented pathways as suggested.

      5) For Fig.7A, a representative tSNE plot for each group (WT Female, KO Female, WT Male, KO Male) should be shown to ensure there is proper integration of the clusters across groups. There are some instances where the scRNA-Seq data do not appear to be integrated properly (i.e., Supplemental Figure 2C). The authors should explore integration techniques (i.e., Seurat; PMID: 29608179) to correct for potential batch effects within the analysis.

      We thank the reviewer for the suggestion of proper integration of the clusters across groups. We performed integration using the Cell Ranger aggregation (aggr) pipeline (see updated materials and methods section). In addition, many technical controls were performed to prevent batch effects between our samples. For sequencing, we used the 10x genomics library sequencing depth and run parameters for both gene expression and multiplexing libraries. For all 3’ gene expression library sequencing, we sequenced at a depth of 20,000 read pairs per cell and for all cell multiplexing library sequencing we sequenced at a depth of 5,000 read pairs per cell. All libraries were paired-end dual indexed libraries and were pooled on one flow cell lane using a 4:1 ratio (3’ Gene expression: Multiplexing ratio) in the Novaseq, as recommended by 10x Genomics, in order to maintain nucleotide diversity and prevent batch effects during the sequencing process. When performing integration/aggregation of all sample gene expression libraries using the Cell Ranger aggregation (aggr) pipeline, we performed sequencing depth normalization between all samples. Cell Ranger does this by equalizing the average read depth per cell between groups before merging all sample libraries and counts together. This is a default setting in the Cell Ranger aggr pipeline, and this approach avoids artifacts that may be introduced due to differences in sequencing depth. Thus, we are confident that changes we observed in gene expression and cell type populations are due to biological differences and not technical variability. Below we have provided a tSNE plot showing clustering of all 12 samples after we performed integration:

      We updated old Fig.7 (now Fig. 6) and included a representative tSNE plot for each group. We also updated the tSNE plot for Figure 5-figure supplement 2C (previously S2C) showing overall clustering amongst all groups. The largest population differences occurred in the fibroblast population and these population differences were largely due to sex differences. Because we are confident that integration was performed appropriately and that batch effects were controlled for, we believe these sex differences are a biological effect.

      6) LncRNA Gm47283 is identified as a gene that is differentially expressed by genotype in HFD females (Fig. 7G); however, according to Ensembl this gene is encoded on the Y-chromosome (https://uswest.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000096768;r=Y:90796007-90827734). The authors should use the RELMα genotype and sex chromosomally-encoded genes to confirm that their multiplexing was appropriate.

      We agree with the reviewer that it is crucial to confirm that multiplexing and all subsequent analyses are performed correctly. Comparison between males and females contains internal controls that increase confidence, such as Xist gene that is expressed only in females, and Ddx3y that is located on the Y chromosome. LncRNA, Gm47283 is located in the syntenic region of Y chromosome and is also present in females, annotated as Gm21887 located in the syntenic region of the X chromosome. It also has 100% alignment with Gm55594 on X chromosome. Additionally, it is also referred to erythroid differentiation regulator 1 (Erd1), x or y depending on the chromosome, although NCBI database specifies partial assembly and incomplete annotation. Therefore, this explains why we see expression of this gene in females. We have discussed this in the text. We revised the text to refer to this LncRNA as Gm47283/Gm21887 to prevent further confusion. The RELMalpha genotype (absence in the KO) was also confirmed. Last, the PC analysis (see Fig 5) supports clustering by group.

      7) For Fig. 8, samples should be co-clustered and integrated across groups before performing trajectory analysis to allow for direct comparisons between groups.

      We appreciate the valuable feedback and suggestions, which have been helpful in clarifying the trajectory analysis, which we have done as follows:

      Regarding the co-clustering and integration of our samples across groups, here is the explanation of our trajectory analysis approach. We have co-clustered all of our samples using the align_cds function from the Monocle3 package. We have included the code for Figure 8 in our Github repository at https://github.com/rrugg002/Sexual-dimorphism-in-obesity-is-governed-by-RELM-regulation-of-adipose-macrophages-and-eosinophils/blob/main/Figure8.R. Specifically, lines 138, 166, 196 and 225 of the code indicate that the align_cds function was used to cluster our samples by "Sample.ID".

      The align_cds function in Monocle3 can be used to co-cluster all samples in a single-cell RNA-seq experiment by aligning coding sequences (CDS) across different cell types or conditions. The align_cds function takes a set of reference CDS sequences and single-cell RNA-seq reads and identifies the CDS sequences within each read, allowing the identification of differentially expressed genes across different cell types or conditions based on the aligned CDS sequences. More details about align_cds can be found here https://rdrr.io/github/cole-trapnell-lab/monocle3/man/align_cds.html .

      We hope that this additional information alleviates the reviewer’s concerns.

      8) Since the experiments presented in this report were from young mice using a single diet intervention, the authors should comment on how age and other obesogenic diets may impact the results found here. Also, the authors should expand their discussion as to what upstream regulators (i.e., hormones or genetics) may be driving the sex differences in RELMα expression in response to HFD.

      We thank the reviewer for the suggestion. We included several sentences to address this comment. However, since reviewers commented that some of the text needs to be trimmed down, extensive discussion regarding reasons for sex differences, which are numerous, are outside the scope of this manuscript. For example, sex differences can arise from all or any of these:

      1. Sex steroid hormones (estrogen and testosterone) are an obvious possibility for sex differences and this discussion has been included below and in the text.

      2. Sex differences we observe may stem from variety of other factors, besides ovarian estrogen; including extraovarian estrogen, primarily estrogen produced in adipose tissues (32119876).

      3. Sex differences exist in fat deposition, which may or may not be estrogen dependent (25578600, 21834845).

      4. Sex difference were determined in metabolic rate and oxidative phosphorylation, which may also be independent of estrogen (28650095, and reviewed in 26339468).

      5. Sex differences exist in the immune system, some of which are estrogen independent, but dependent on sex chromosomes (32193609).

      6. Sex differences particularly in myeloid lineage, which may also be estrogen independent (25869128).

      7. Sex differences were determined in adipokine levels, including leptin and adiponectin, which influence immune cells in adipose tissues (33268480).

      The role of estrogen is not clear either, and thus extensive discussion is not possible. Numerous studies demonstrated that estrogen is protective from inflammation, thus it is possible that estrogen drives some of the sex differences observed herein. However, several studies determined that estrogen can be pro-inflammatory (20554954, 15879140, 18523261). Previous publications by us (30254630, 33268480) and others (25869128) demonstrated intrinsic sex differences in immune system, that are maybe dependent on sex chromosome complement and/or Xist expression (34103397, 30671059).

      Studies are more consistent that estrogen is protective from weight gain: postmenopausal women with diminished estrogen, and ovariectomized animal models gain weight. The effects of ovariectomy on weight gain and its additive effects with high fat diet were reported in Rhesus monkeys (for example PMID: 2663699; and PMID: 16421340); and in rodents (PMID: 7349433).

      The reviewer is correct that the effects of aging or estrogen on RELMa levels would be of significant interest, and could be a future direction of our studies. Aging-mediated increase in inflammation (including of adipose tissue, recently reviewed in 36875140), that may be dependent on estrogen, can exacerbate obesity-mediated inflammation. We have added this discussion.

      For these reasons we limited our discussion regarding possible differences and stated this in the discussion: “Several studies demonstrated the protective role of estrogen in obesity-mediated inflammation and in weight gain, as discussed above. Whether estrogen protection occurs via estrogen regulation of RELMa levels is a focus of our future studies. Alternatively, intrinsic sex differences in immune system have been demonstrated as well (30254630, 33268480, 25869128) that are dependent on sex chromosome complement and/or Xist expression (34103397, 30671059), and RELMa may be regulated by these as well. Additionally, ageing-mediated increase in inflammation (including of adipose tissue, recently reviewed in 36875140), may also occur via changes in RELMa levels. Our studies used young but developmentally mature mice (4-6 weeks old when placed on diet, 18 weeks old at sacrifice), and future work on aged mice would be needed to investigate aging-mediated inflammation. Furthermore, there are sex differences in fat deposition, metabolic rates and oxidative phosphorylation (reviewed in 26339468), and adipokine expression (Coss) that regulate cytokine and chemokines levels, and therefore may regulate levels of RELMa as well. These possibilities will be addressed in future studies.”

    1. Author Response

      Reviewer #1 (Public Review):

      Most work on antibiotic resistance focuses on particular resistance genes often located on plasmids, but rarely how these genes interact with others located on the chromosome of the host organism. Considering variation in the host genome and its interaction with resistance plasmids can help predict which hosts are more likely to become resistant to a given antibiotic and explain why the same plasmid may not confer the same level of resistance to different strains.

      The authors take a clever approach to finding such genetic interactions by designing an evolution experiment using E. coli carrying an MCR-1 plasmid containing resistance genes to colistin. They then select for increased resistance to colistin and sequence the genomes of the most resistant isolates. This allowed them to identify a particular gene lpxC that confers increased resistance to E. coli when combined with the MCR-1 plasmid (more than the sum of each mutation alone) and find that this is because of decreased membrane surface charge. They then investigate whether this mutation is relevant in wild E. coli isolates by analysing environmental samples from patients and other sources and find that indeed, this mutation is often found in carriers of the MCR-1 plasmid.

      The study is very well-designed and presented in a concise and logical manner. The use of evolution experiments to identify the mutations and then engineer them to quantify the epistatic effects and understand the mechanism behind them is very elegant. The real-world relevance is then supported by looking for these mutations in environmental samples. Despite this simplicity and clarity, in some places, the writing could be improved. I particularly found that the second half of the paper was not as easy to follow as the first part and could benefit from some clarifications. The figures could also contain a bit more information to help the reader.

      Thank you!

      1.1 For example, the abstract starts by talking about standing genetic variation but it's not immediately clear what is meant by that. Standing genetic variation seems to suggest that the resistance gene itself is present in the initial population, rather than variation in other loci that might affect the selection of the resistance gene. This could be better formulated.

      We have revised the abstract to be clearer about the source of genetic variation.

      1.2 The figures could be improved by being more specific about the datasets: are mutations in Figure 2 in the WT or the MCR-1 positive lines? Are the SNPs in Fig. 4A in lpxC? Do all isolates in Fig. 4 have the MCR-1 plasmid?

      Thank you for the comment. We have edited the figure legend (line 128, page 5). Yes, Fig. 4A shows SNPs in lpxC, and all the isolates in Fig 4 have the MCR-1 plasmid. We have now clarified this in the figure legend (line 230, page 9).

      1.3 Finally, the arguments being made about diversity in the different phylogroups were not very clear. This could be made more explicit at first mention, rather than later in the discussion section.

      We have revised this section to clarify theses points (lines 242-245, page 10).

      Reviewer #3 (Public Review):

      Jangir et al. used an 'evolutionary ramp' experiment to evolve E. coli strains under the selection pressure of increasing colistin concentrations wherein the surviving fractions were collected for genomic analysis. They report that the mcr-1 carrying strain evolved higher colistin resistance much faster only in presence of lpxC mutations in the genome. They identify the mcr-1 and lpxC interactions to be positively epistatic and mutations only in lpxC do not lead to resistance to colistin. Taking a cue from their evolution experiments, they looked for the variations in lpxC sequences in the genomic datasets of clinical E. coli strains. They found many such variations in the genomes of clinical isolates. Importantly, they found those variations to be present even in non-resistant strains which might predispose those strains to gain untreatable levels of colistin resistance.

      Strengths:

      The study focuses on two key aspects of antibiotic resistance in clinical settings. First, is the antibiotic colistin itself which is part of the last line of defense. Second, is the importance of genomic variations in clinical isolates that have not been linked to any antibiotic resistance mechanisms. The data were presented in a logical sequence and maintained brevity. The link of lpxC to mcr-1 resistance is convincing.

      Thank you!

      Weaknesses:

      The basic premise of the paper is solid but the following should be addressed.

      3.1 In Figure 1, the authors applied the 'evolutionary ramp' method to isolate evolved strains with higher MIC to colistin; but, the conditions for the evolution of WT and strain carrying mcr-1 are different.Maintaining mcr-1 requires antibiotic selection which WT cannot withstand. Hence, if I am not mistaken, WT was not grown in the presence of any antibiotic.

      The referee’s assertion that the selective pressures experienced by the WT and MCR+ populations were different is incorrect. We increased relative antibiotic dose (i.e., as a fraction of the MIC of the parental strains) at the same rate for both the WT and MCR+ populations. This is clearly explained in the text (lines 98-100, page 3), and the absolute colistin doses are shown in Figure 1. Please also see response 2.4 above.

      In our study, we used a naturally occurring MCR-1 carrying plasmid from the IncX4 family. This plasmid is actually very stable (in the short term at least) in the absence of colistin, in spite of the costs imposed by MCR-1. We speculate that this stability in part reflect the high conjugation rate of the plasmids and the presence of a toxin-antitoxin module.

      3.2 Not only that, maintaining a ~32 Kb plasmid itself can have different selective landscapes. The authors may replicate the experiment with their low-copy clone of mcr-1 which would make it easier for the authors to have an empty vector in WT as a proper control. Since now they know the expected mutations to be in lpxC, they might sequence a PCR amplicon of that region for validation of their hypothesis.

      This is an interesting idea for a future study. We agree with the referee that the presence of the MCR-1 plasmid may impose additional selective pressures that could potentially lead to bacteria-plasmid co-evolution. However, our data suggests that bacteria-plasmid interactions were not an important selective force over the course of our experiment: we detected no mutations in the plasmid and almost all of the chromosomal mutations that we detected could be easily associated with selective pressures imposed by colistin.

      3.3 In Figure 2, what are the effects of these mutations in lpxC? The authors state that many mutations map on to the metal binding domain; but are those significant changes? LpxC is relatively well characterized and authors may want to comment on these mutations a little more.

      Yes, most of the evolved lines had mutations in the metal-binding domain site, and it is known that this site is very important for lpxC activity. For example, mutations at positions 79, 238, 242 and 246 lead to a hundred to thousand-fold decrease in lpxC activity (PMID: 24117400, 24108127, and 11148046), and many of our mutations map close to these sites (lines 140-143, page 6, and Figure 2b).

      3.4 Also, lpxC mutations showed enrichment but lpxA did not. Is this suggestive of the type of Lipid A that is more preferred for the epistatic interactions? The authors may want to comment on that.

      Interestingly, this could be the case that the epistatic interactions depend on the type of lipid A modification and the associated pleiotropic effects. Because mutations in LPS biosynthesis genes can have diverse adverse effects as it alters the membrane properties. However, in-depth future work is required to understand how the different types of changes in lipid A influence interactions with MCR. We chose not to further explore this in the paper, because lpxA was rarely mutated (2/17 clones) compared to lpxC (11/17 clones).

      3.5 In Figure 3, the lpxC mutant shows a reduction in fitness in a competition assay. What is the growth pattern of individual strains?

      The standard growth curve assay shows no significant difference in growth rate between LpxC mutant and wild-type strain (figure below). This is evident by the fact that standard growth curves are not ideal for capturing small differences in growth/fitness. Therefore, we emphasize the results of the competition experiment as this is gold standard method for measuring fitness effects (Figure 3c).

      3.6 There is a possibility that slow growth of lpxC mutant provides benefit under antibiotic stress.

      This is an interesting idea, but in this case, the slow growth of the lpxC mutant is clearly associated with a small decrease in colistin resistance (Figure 3A).

      3.7 Minor comment: the three individual replicates shown in Figure 3a are all identical within a sample and do not add to the figure where n=3. The authors can simply show SD or report correct values of replicates.

      We chose to show the raw data points, as this is the style of presentation that is being increasingly used by journals (i.e., many journals now say show all raw data points when n<6 or 10). It would not make sense to show a standard error as this was equal to 0.

      3.8 In Figure 4, as the authors themselves have stated, the difference in heterogeneity could be simply due to variation within phylogroups and subsequent compositional differences within the populations. The authors must check if mutations were found in the same location of lpxA as found in their own evolved strains. Without this information, the heterogeneity data would be speculative. Adding the lpxC variants reported in figure 2 to the trees of figure 4 (right) will make it clear if their conclusion is justified.

      This is an interesting point. We found no overlap between our experimentally evolved mutations and naturally occurring lpxC mutations, either at the level of nucleotides or codons. However, it is unclear if we should expect to see an overlap for two reasons: 1. The mutations present in natural isolates likely reflect a combination of beneficial mutations, neutral mutations, and weakly deleterious mutations. The mutations found in our evolved isolates, on the other hand, are all mutations that were beneficial under colistin selection. As such, it is probably not reasonable to expect a strong overlap between the two sets of mutations. 2. The lpxC mutations that we observed in our 11 lpxC mutated isolates are highly diverse – we found no cases of parallel evolution at the nucleotide level, and only a single example of parallel evolution at the codon level. Given this, our data suggest that a very wide diversity of sites of lpxC can interact epistatically with MCR-1 to increase colistin resistance. Again, this high diversity of potential lpxC mutations should give a weak association between lab evolved and clinical isolates.

      We have added these points in the text (lines 278-304, pages 11-12).

      3.9 The authors can perform a confirmatory experiment for the pre-existing part of their hypothesis. If they perform the evolutionary ramp experiment with a strain carrying lpxC mutant strain, will they see faster evolution of high MIC mutants?

      This is an interesting idea, our results suggest that more rapid evolution of high level colistin resistance would occur in the lpxC mutant compared to a wild-type strain (assuming that both had an equivalent opportunity to acquire MCR-1 by horizontal gene transfer).

      4.0 The rationale of how the presence of lpxC mutations can cause a strain without any colistin resistance to acquire mcr-1 is not addressed. The authors may want to comment on that.

      MCR-1 is carried on conjugative plasmids, and the main plasmid families that carry MCR-1 (IncI2 and IncX4) have high conjugative rates. We have changed the text of introduction to emphasize that MCR-1 is carried on conjugative plasmids, and we have linked MCR-1 acquisition to plasmid conjugation (lines 327-328, page 13).

    1. Autho Response

      Reviewer #1 (Public Review):

      Here the authors aimed to gain insight into the role of Septin-7 in skeletal muscle biology using a novel and powerful mouse model of inducible muscle specific septin-7 deletion. They combine this with CRISPR/Cas9 and shRNA mediated manipulation of Septin-7 in C2C12 cells in vitro to explore its role in muscle progenitor morphology and proliferation. There are a variety of interesting observations, with clear phenotypes induced by the Septin-7 manipulation, including effects on body weight, muscle force production, mitochondrial morphology, and cell proliferation. However each area is somewhat superficially examined, and certain conclusions require additional validation for robust support. Additionally, mechanistic insight into Septin 7's role is limited. Therefore, while the phenotypes are likely of intrigue to both the muscle and septin community, to significantly advance the field will require additional experimentation.

      Specifically, it is currently difficult to distinguish between developmental and adult roles of Septin-7. The authors induce tamoxifen-mediated deletion at 1 month of age and examine muscle structure/function only at 4 months. By not studying early time points, it is difficult to determine whether particular phenotypes are directly due to Septin deletion or a secondary consequence of muscle atrophy and/or a decline in body weight. Further, by not inducing deletion at a later time point (i.e. after 2 months when muscle is generally matured), it is difficult to assess whether septin-7 plays a role in maintaining structure and function of mature muscle, or if its primary role is in muscle development.

      We have conducted a number of trials for knocking-down of Septin-7 expression. These included Tamoxifen treatment of Cre- pregnant mothers, shorter treatments starting at early after birth, and treatments of adult animals. While the former led to still-born offsprings, the later resulted in only a minor – less than 20% - reduction of Septin-7 expression. These long trials led us to, on the one hand, concentrate on the protocol used throughout the manuscript (where a significant, up to 50%, reduction in the expression of the protein could be achieved) and to, on the other hand, focus also on myogenic cells in culture. This selection was also substantiated by the finding that Septin-7 expression is the highest in neonatal muscles and declines with age until adulthood (but remains essentially constant until an age of 18 months for the mice examined). As an identical Tamoxifen treatment of littermate Cre- mice did not result in any of the presented alterations (as demonstrated in the Supplementary material) we can conclude that they are the consequence of Septin-7 down-regulation. We, nonetheless, completely agree with the Reviewer that some observations are most likely indirect, i.e., are due to the loss of muscle mass. These include, e.g., the altered shape of the vertebra and the consequent “hunchback” phenotype. However, this observation further supports our claim that Septin-7 is essential for proper development of a normal musculature in these animals.

      Further, the conclusion that septin-7 has an essential role in regeneration (seemingly based on expression increasing after injury) is unsupported and requires further experimentation where injury and regeneration is triggered in the absence of Septin-7 to establish a causative role.

      We agree with the Reviewer that a clear causative role of Septin-7 in muscle regeneration would require a substantial amount of further experimentation on Septin-7 knock-down animals. We, however, believe that this – detailed description of the changes in transcription factors and key regulatory proteins together with changes in morphology in Septin-7 KD animals following muscle injury – is beyond the scope of the present manuscript and should be presented as a separate study. In this manuscript, however, we provide the essential background to substantiate this claim. We describe that fusion of myogenic cells is severely hindered if Septin-7 expression is suppressed while Septin-7 is upregulated following muscle injury to the extent which is significantly more than what would be expected if it would be simply due to the production of new muscle fibers.

      Finally, there are intriguing observations in mitochondrial and myofiber organization and mitochondrial content; however further interrogation into additional relevant metrics of each, and at different time points of Septin-7 deletion, are needed to better understand these phenotypes and gain insight into Septin-7's role in their regulation.

      Accepting the concern of the Reviewer we have conducted additional experiments to enable the proper characterization of the morphology. Additional relevant metrics – Aspect Ratios and Form Factors – have been calculated and are now incorporated into the revised MS and are presented in Figure 5.

      Reviewer #2 (Public Review):

      This is a comprehensive work describing for the first time the location and importance of the cytoskeletal protein Septin-7 in skeletal muscle. The authors, using a Septin-7 conditional knockdown mouse model, the C2C12 cell line, and enzymatically isolated adult muscle fibers, explore the normal location of this protein in muscle fibers, the morphological alterations in conditioned knockdown conditions, the developmental alterations, and the functional alterations in terms of force production. The global picture that emerges shows Septin-7 as a fundamental brick in both muscle construction, development, and regeneration; all this leads to reinforcing the basically structural nature of this protein role.

      We thank the Reviewer for the appreciative words. We indeed believe that Septin-7 plays and important role in the proper organization and development of skeletal muscle. Even a partial knock-down of the protein at the early stages of life results in a severe loss in muscle mass accompanied by skeletal deformities. A complete knock-out of the protein results, at the myoblast level, in the inability of the cells to proliferate and form multinucleated cells confirming the essential role of this structural protein.

      Reviewer #3 (Public Review):

      This is an original study to explore the role of Septin-7, a cytoskeleton protein, in skeletal muscle physiology. The authors produced a unique mouse model with Septin-7 conditional knockdown specifically in skeletal muscle, which allowed them to examine the structure and function changes of skeletal muscle in response to the reduced protein expression level of Septin-7 in vivo and ex vivo at different development stages without the influence of other body parts with reduced Septin-7 expression. The study on the cellular model, C2C12 myoblast/myotubes with knockdown of Septin-7 expression, provided additional evidence of the importance of this cytoskeleton protein in regulating myoblast proliferation and differentiation. Majority of the data are supportive of the the major claim in this manuscript. However, additional key experiments and data analysis are needed to provide more mechanistic characterization of Septin-7 in muscle physiology.

      We would like to express our thanks to the Reviewer for the critical comments on our manuscript and for the valuable suggestions that help substantiate our claim, that Septin-7 is an essential part of the cytoskeletal network in skeletal muscle and plays an important role in muscle differentiation as well as in myoblast proliferation and fusion.

      A number of additional experiments were carried out to answer the comments/concerns of the Reviewer. Immunostaining of critical proteins (actin, myosin, and the L-type calcium channel) are now presented in Figure S4 for Cre+ animals. The T-tubules of enzymatically isolated fibers from these Septin-7 knock-down mice were also stained using Di-8-ANEPPS and the corresponding images are presented below. We describe how different Tamoxifen treatments at different time-points in the intra- and extra-uterine life of the animals resulted in the deletion of the SEPTIN 7 gene which ultimately led us to use the protocol (largest reduction with still viable mice) described in this manuscript. A more detailed description on how the fusion index, a clear marker a myotube differentiation, was conducted using desmin staining is now included and additional experiments (immunostaining and western blot) with MYH as suggested by the Reviewer are also presented. We carried out a thorough analysis of mitochondrial morphology (in line with the requirements of another Reviewer) and modified the corresponding figure in the revised MS accordingly.

      Major Concerns:

      1) The Septin-7 knockdown mouse model, the EM and IHC techniques are all established in the research group. It is a surprise to see that authors missed the opportunity to characterize the morphological changes in the T-tubule network, triad structure, the distribution of Ca release units (i.e., IHC of DHPR and RyR), and its co-localization with other key cytoskeletal proteins (i.e. actin) etc., in the muscle section or isolated muscle fibers.

      We appreciate the reviewer's valuable critical comments. Even if we were not able to fully comply with all the requests, we corrected as many of the mentioned shortcomings as possible, by correcting the errors and to prove our claims with further experiments. Please find our responses to each critical remark below.

      We conducted IHC staining on individual FDB fibers of C57Bl/6 mice presenting the distribution of skeletal muscle specific α-actinin, and RyR1 alongside with Septin-7 proteins (Figure 1E and F). As demonstrated in Figure 5E and F of the original MS (Figure 5 F and G in the revised version) normal triad structures were present both in Cre- and Cre+ muscle samples using EM analysis. However, the sarcomeres were distorted at places where large mitochondria appeared in Cre+ samples.

      As suggested, T-tubule staining by Di-8-ANEPPS was carried out on isolated FDB fibers from Cre- and Cre+ animals, which revealed no considerable differences between the two groups.

      Images present the T-tubule system of a single muscle fibers isolated from Cre- and Cre+ FDB muscle. Di-8-ANEPPS staining reveals no considerable difference between the two type of animals suggesting that the reduced Septin-7 expression does not alter the T-tubular system of skeletal muscle cells.

      To further investigate the key components of muscle contraction and EC coupling, we carried out immunostaining in isolated single fibers from FDB muscle originating from Cre+ and Cre- mice. Immunocytochemistry revealed no significant alteration of actin, myosin 4, and L-type calcium channel labeling comparing the two mouse strains (see Figure S4 in the revised version).

      2) The authors only studied one time point following the Tamoxifen treatment (4-month old with 3-month treatment). Based on Fig 2D, a significant body weight reduction was achieved after one month of the Tamoxifen treatment (at the age of 7 weeks), indicating a potential reduced muscle development at this age. Mice are considered fully matured at the age of 2 months. It will be more informative if the muscle samples and the in vivo and in vitro muscle activity are analyzed at this time point (7 or 8-week old), which should provide a direct answer if the knockdown of Septin-7 affects the muscle development. Additionally, a time dependent correlation of the level of Septin-7 knockdown with muscle function/morphology analysis should better define the role of Septin-7 in muscle development and function.

      We agree with the Reviewer that Septin-7 has presumably more pronounced effect in the early stage of muscle development, since we detected higher expression level of the protein in muscle samples isolated from newborn and young as compared with adult animals. We conducted preliminarily in vivo and in vitro force experiments on 2-month-old mice after 1 month of Tamoxifen treatment. The grip force already decreased significantly in Cre+ mice but the decrease in twitch and tetanic force of EDL and Sol did not reach significance. These experiments were followed by the analysis of Septin-7 level in the muscle samples which showed less than 20% of reduction on average in the samples of Cre+ mice. This suggested that a more robust suppression of Septin-7 is needed to reach significant reduction in in vitro force thus we decided to extend the Tamoxifen treatment to 3 months.

      3) Although the expression level of Septin-7 reduced during muscle development (Fig 1C), but its expression is still evident at the age of 4 months (Fig 1C and Fig S1F), indicating a potential role of Septin-7 in maintaining normal muscle function. It is important to examine whether the Tomaxifen treatment started after the muscle maturation at the age of 2-month old would affect the muscle structure and function. Particularly, these type of KD mice will be critical to answer if the KD will affect the regeneration rate following the muscle injury. The outcome will further test or support their claim of the essential roles of Septin-7 in muscle regeneration.

      We agree with the Reviewer opinion that Septin-7 presumably plays an essential role not only during the early development of skeletal muscle but also in the matured tissue. In our preliminary studies Septin-7 protein expression was determined in skeletal muscle samples from mice at different developmental stage. As presented in Figure 1C we observed decrease in Septin-7 protein expression from newborn to adult stages. The expression profile of Septin-7 was also investigated in samples from 2, 4, 6, 9, and 18-month-old mice and a significant decrease was observed in samples isolated from mice of 4, 6, 9, and 18 months of age (58±8; 48±9; 66±16; 54±9% relative to the 2-month-old muscles, respectively), however there were no considerable changes between samples after 4 months of age.

      In order to generate skeletal muscle specific, conditional Septin-7 knock-down animals, we applied Tamoxifen treatment at different developmental stages in our preliminary studies (see the table and figures below). When Cre- pregnant females were fed with Tamoxifen in the third trimester of pregnancy, it caused intrauterin lethality independent of the genotype. According to the animal ethics requirements we did not continue this experimental protocol. In the next stage of our initial experiments, 3 month-old mice were treated with both intraperitoneal injections for 5 consecutive days or Tamoxifen diet for 4 weeks. Here, only a moderate deletion of the exon4 was detected in SEPTIN 7 gene in Cre+ animals (data obtained from these mice are shown below).

      These findings and the observation of ontogenesis dependent expression of Septin-7 indicated its significance at the early stage of development and suggested that we should try to modify the gene expression at earlier age. Six weeks of diet supplemented with Tamoxifen generated well detectable exon deletion in younger (1-month-old) mice. Regarding these observations we decided to start the Tamoxifen-supplemented diet in younger (4-week-old) animals immediately after separation from the mother and we continued the treatment for a longer period (3 months) to be sure that exon deletion will be prominent in all Cre+ animals.

      Genetic modification of SEPTIN 7 gene following Tamoxifen treatment in mice mentioned above. RT-PCR

      Figure presents the presence of floxed sites at SEPTIN 7 gene (white arrow) and the deletion of exon4 (red arrows) in the appropriate DNA samples isolated from mice treated with Tamoxifen from different age and using different methods and period of Tamoxifen application. Exon4 deletions were less than 20%, therefore these trials were not continued. Numbers above each lane correspond to the animal ID-s presented in the table above. Q – m. quadriceps, B- m. biceps femoris, P – m. pectoralis.

      The knock-down of Septin-7 in the adult animals (where its expression is already low; see above) did not result in an appreciable further reduction. This led us to conclude that the role of Septin-7 is most pronounced in muscle development. In this framework, at the adult stage a possible function of Septin-7 in muscle regeneration following injury could be envisioned. This is demonstrated in Fiure 6 where we present that Septin-7 is upregulated following a mild injury. However, we believe, that a detailed examination of the role of Septin-7 in the regeneration is beyond the scope of the current manuscript and should be the basis of further studies.

      4) Regarding the impact of Septin-7 on differentiation, it could be problematic if the images with the resolution shown in Figure S4A-C were used for fusion index calculation. If those are just zoomed in representative images and the authors used other lower resolution, global view images for quantification, those images are needed to be shown. The authors may also need to elaborate on why they stained Desmin instead of MYH for quantification of the fusion index of myotubes (page 27). Desmin also marks mesenchymal cells.

      We apologize that the method used for fusion index calculation was not clear enough. Images in Figure S4A-C present the Septin-7 and actin cytoskeletal structure in proliferating myoblasts, before the induction of differentiation. Fusion index was determined in cultures where myotube differentiation was induced by reduced serum content (as described in Methods). We used desmin staining as the expression of this protein is present only in myotubes with 2 or more nuclei, where fusion of myoblasts has already started (see representative images below). Representative desmin-labeling images from control, scrambled and KD cultures are now included in Figure S5G at 5 days differentiated stage.

      Figure presents two examples (bottom row is now added to Figure S5 as panel G) of the desmin-specific immunostaining used for the calculation of fusion index in the different C2C12 cultures. Specific signals of desmin are present following the fusion of single nuclei myoblast into myotubes (green), while non-differentiated myoblasts did not show immunolabeling for desmin. Nuclei are stained with DAPI (blue).

      If Septin-7 is truly affecting differentiation, a decrease of MYH 2 expression can be readily detected by IHC or WB.

      We are grateful for the Reviewer´s suggestion. We have conducted immunocytochemistry and WB experiments in proliferating myoblasts and myotubes at day 5 of differentiation. As the figure below demonstrates, myosin heavy chain-specific immunolabeling could be detected only in differentiated samples, while myoblasts did not show positive signal. However, there is a significantly lower number of MYH2-positive myotubes in Septin-7 KD cultures as compared with the control and scrambled samples. In addition, we detected decreased WB signal for MYH2 in Septin-7 KD protein samples compared with their control counterparts.

      Figure presents the MYH2-specific immunostaining in the different C2C12 cultures. Specific signals of myosin heavy chain 2 (green) are present during myotube formation of differentiating cultures, however, less MYH2-positive myotubes are present in the Septin-7 KD cultures as a result of reduced capability of cells to fuse, here the DAPI-stained nuclei were only present. Proliferating myoblasts did not show specific immunolabeling for MYH2, as the confocal image and the appropriate part of the WB membranes show. We could also detect a decreased MYH2-specific labeling in Septin-7 KD samples as compared with the control ones using WB.

      Additionally, Septin-7 may also affect the migration or fusion of myoblasts instead of differentiation. The observation of altered cell morphology and filopodia/lamellipodia formation (Figure 3C) in Septin7-KD cells before differentiation also implies a potential role of Septin-7 in migration. This possibility should be at least discussed.

      We appreciate the Reviewer´s comment and suggestion. There are a few publication showing that alteration of septin (in some cases Septin-7) expression modifies the migration of different eukaryotic cell types, like in microvascular endothelial cells (PMID: 24451259), in human epithelial cells (PMID: 31905721), in neural crest cells (PMID: 2881782), and in human breast cancer or lung cancer cells (PMID: 27557506, 31558699, and 32516969). In the work of Li et al. (PMID:32382971) their findings revealed that miR-127-3p regulates myoblast proliferation by targeting Septin-7. In the present manuscript we described that Septin-7 modification alters myoblast fusion (Figure 3J), which is the accompanying phenomenon of differentiation. On the other hand, the effect of Septin-7 gene silencing on cell migration has been studied in detail and was presented to The Biophysical Society. The results are intended to be submitted as a separate manuscript.

      5) The image shown in Figure 5F does not support the pooled data showed in Figure 5C. The size of mitochondria is remarkably lager in Cre+ muscle (Fig 5E and 5F). The morphology of mitochondria in Cre+ muscle are apparently normal (Fig 5F), while the mitochondrial DNA content are drastically reduced (Figure 5H), which is an important discovery and deserved to be further confirmed by WB and/or qPCR for critical mitochondrial proteins (i.e. MTCOX, COXV, etc.).

      We thank the Reviewer for pointing out that the interpretation of images in Figure 5 was not clear enough. Based on this, and the on the clear request from the other Reviewer, a detailed evaluation of mitochondrial morphology was carried out and the panels of Figure 5 were redrawn and reorganized. The revised Figure 5 now presents the average Perimeter, the average Aspect Ratio, and the average Form Factor (panels C & H, for cross- & horizontal-sections, respectively), the relative distributions of the areas (panels D & I, for cross- & horizontal-sections, respectively), and the number of mitochondria normalized to fiber area (panel E, cross-sections). The mitochondrial DNA content is presented in panel J. As evidenced from these figures (and from the representative EM micro graphs), larger mitochondria, sometimes in large associations, are present in the muscles of Cre+ animals.

      Furthermore, gene expression of four essential mitochondrial proteins cytochrome oxidase 1 (COX1), cytochrome oxidase 2 (COX2), succinate dehydrogenase (SDH), and ATP synthase) were determined in RNA samples from different skeletal muscles of Cre- and Cre+ animals using qPCR. As the figure below demonstrates there was a tendency of decreased expression of the aforementioned genes in Cre+ muscle samples, however, significant difference between the Cre- and Cre+ data could not be detected.

      Figure represents the normalized mRNA expression of ATP synthase, SDH, COX1, and COX2 in Cre- (green) and Cre+ (red) samples isolated from m. quadriceps and m. pectoralis. Each gene expression was determined from 3 individual animals and a technical duplicate was used during the qPCR analysis. 36B4 gene encoding an acidic ribosomal phosphoprotein P0 was used as a normalizing gene.

      6) Figure 2 H & I: It is unclear whether the muscle force was normalized to the individual muscle weight.

      We are sorry about the incomplete representation and explanation of muscle force values. Figure 2F-I presents absolute force values without normalization to the cross sectional area. In order to answer the Reviewer´s comment the averages of normalized values are given in Table S3 in the modified manuscript.

      7) The IHC results in Figure 6B are confusing. There are no centrally located nuclei in the Pax7 alone image of Figure 6B but abundant in the Pax7 + H&E image. The brown color of DAB and the purple color of hematoxylin are hard to be distinguished.

      Images presenting the labeling of Pax7 (a transcription factor expressed in activated satellite cells) alone could not show centrally located nuclei, as the nuclei could only be visible when HE staining is applied. As the Reviewer mentioned brown color of DAB and the purple color of hematoxylin are sometimes difficult to distinguish, therefore, we first presented PAX7 expression visualized by DAB staining (localization was near the sarcolemma). In the next step we performed a double staining for PAX7 and HE to show both the cytoplasm and nuclei.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Goering et al. investigate subcellular RNA localization across different cell types focusing on epithelial cells (mouse C2bbe1 and human HCA-7 enterocyte monolayers, canine MDCK epithelial cells) as well as neuronal cultures (mouse CAD cells). They use their recently established Halo-seq method to investigate transcriptome-wide RNA localization biases in C2bbe1 enterocyte monolayers and find that 5'TOP-motif containing mRNAs, which encode ribosomal proteins (RPs), are enriched on the basal side of these cells. These results are supported by smFISH against endogenous RP-encoding mRNAs (RPL7 and RPS28) as well as Firefly luciferase reporter transcripts with and without mutated 5'TOP sequences. Furthermore, they find that 5'TOP-motifs are not only driving localization to the basal side of epithelial cells but also to neuronal processes. To investigate the molecular mechanism behind the observed RNA localization biases, they reduce expression of several Larp proteins and find that RNA localization is consistently Larp1-dependent. Additionally, the localization depends on the placement of the TOP sequence in the 5'UTR and not the 3'UTR. To confirm that similar RNA localization biases can be conserved across cell types for other classes of transcripts, they perform similar experiments with a GA-rich element containing Net1 3'UTR transcript, which has previously been shown to exhibit a strong localization bias in several cell types. In order to determine if motor proteins contribute to these RNA distributions, they use motor protein inhibitors to confirm that the localization of individual members of both classes of transcripts, 5'TOP and GA-rich, is kinesin-dependent and that RNA localization to specific subcellular regions is likely to coincide with RNA localization to microtubule plus ends that concentrate in the basal side of epithelial cells as well as in neuronal processes.

      In summary, Goering et al. present an interesting study that contributes to our understanding of RNA localization. While RNA localization has predominantly been studied in a single cell type or experimental system, this work looks for commonalities to explain general principles. I believe that this is an important advance, but there are several points that should be addressed.

      Comments:

      1) The Mili lab has previously characterized the localization of ribosomal proteins and NET1 to protrusions (Wang et al, 2017, Moissoglu et al 2019, Crisafis et al., 2020) and the role of kinesins in this localization (Pichon et al, 2021). These papers should be cited and their work discussed. I do not believe this reduces the novelty of this study and supports the generality of the RNA localization patterns to additional cellular locations in other cell types.

      This was an unintentional oversight on our part, and we apologize. We have added citations for the mentioned publications and discussed our work in the context of theirs.

      2) The 5'TOP motif begins with an invariant C nucleotide and mutation of this first nucleotide next to the cap has been shown to reduce translation regulation during mTOR inhibition (Avni et al, 1994 and Biberman et al 1997) and also Lapr1 binding (Lahr et al, 2017). Consequently, it is not clear to me if RPS28 initiates transcription with an A as indicated in Figure 3B. There also seems to be some differences in published CAGE datasets, but this point needs to be clarified. Additionally, it is not clear to me how the 5'TOP Firefly luciferase reporters were generated and if the transcription start site and exact 5'-ends of these constructs were determined. This is again essential to determine if it is a pyrimidine sequence in the 5'UTR that is important for localization or the 5'TOP motif and if Larp1 is directly regulating the localization by binding to the 5'TOP motif or if the effect they observe is indirect (e.g. is Larp1 also basally localized?). It should also be noted that Larp1 has been suggested to bind pyrimidine-rich sequences in the 5'UTR that are not next to the cap, but the details of this interaction are less clear (Al-Ashtal et al, 2021)

      We did not fully appreciate the subtleties related to TOP motif location when we submitted this manuscript, so we thank the reviewer for pointing them out.

      We also analyzed public CAGE datasets (Andersson et al, 2014 Nat Comm) and found that the start sites for both RPL7 and RPS28 were quite variable within a window of several nucleotides (as is the case for the vast majority of genes), suggesting that a substantial fraction of both do not begin with pyrimidines (Reviewer Figure 1). Yet, by smFISH, endogenous RPL7 and RPS28 are clearly basally/neurite localized (see new figure 3C).

      Reviewer Figure 1. Analysis of transcription start sites for RPL7 (A) and RPS28 (B) using CAGE data (Andersson et al, 2014 Nat Comm). Both genes show a window of transcription start sites upstream of current gene models (blue bars at bottom).

      A more detailed analysis of our PRRE-containing reporter transcripts led us to find that in these reporters, the pyrimidine-rich element was approximately 90 nucleotides into the body of the 5’ UTR. Yet these reporters are also basally/neurite localized. The organization of the PRRE-containing reporters is now more clearly shown in an updated figure 3D.

      From these results, it would seem that the pyrimidine-rich element need not be next to the 5’ cap in order to regulate RNA localization. To generalize this result, we first used previously identified 5’ UTR pyrimidine-rich elements that had been found to regulate translation in an mTOR-dependent manner (Hsieh et al 2012). We found that, as a class, RNAs containing these motifs were similarly basally/neurite localized as RP mRNAs. These results are presented in figures 3A and 3I.

      We then asked if the position of the pyrimidine-rich element within the 5’ UTR of these RNAs was related to their localization. We found no relationship between element position and transcript localization as elements within the bodies of 5’ UTRs were seemingly just as able to promote basal/neurite localization as elements immediately next to the 5’ cap. These results are presented in figures 3B and 3J.

      To further confirm that pyrimidine-rich elements need not be immediately next to the 5’ cap, we redesigned our RPL7-derived reporter transcripts such that the pyrimidine-rich motif was immediately adjacent to the 5’ cap. This was possible because the reporter uses a CMV promoter that reliably starts transcription at a known nucleotide. We then compared the localization of this reporter (called “RPL7 True TOP”) to our previous reporter in which the pyrimidine-rich element was ~90 nt into the 5’ UTR (called “RPL7 PRRE”) (Reviewer Figure 2). As with the PRRE reporter, the True TOP reporter drove RNA localization in both epithelial and neuronal cells while purine-containing mutant versions of the True TOP reporter did not (Reviewer Figure 2A-D). In the epithelial cells, the True TOP was modestly but significantly better at driving basal RNA localization than the PRRE (Reviewer Figure 2E) while in neuronal cells the True TOPs were modestly but insignificantly better. Again, this suggests that pyrimidine-rich motifs need not be immediately cap-adjacent in order to regulate RNA localization.

      Reviewer Figure 2. Experimental confirmation that pyrimidine-rich motif location within 5’ UTRs is not critical for RNA localization. (A) RPL7 True TOP smFISH in epithelial cells. (B) RPL7 True TOP smFISH in neuronal cells. (C) Quantification of epithelial cell smFISH in A. (D) Quantification of neuronal cell smFISH in D. (E) Comparison of the location in epithelial cells of endogenous RPL7 transcripts, RPL7 PRRE reporter transcripts, and PRL7 True TOP reporter transcripts. (F) Comparison of the neurite-enrichment of RPL7 PRRE reporters and RPL7 True TOP reporters. In C-F, the number of cells included in each analysis is shown.

      In response to the point about whether the localization results are direct effects of LARP1, we did not assay the binding of LARP1 to our PRRE-containing reporters, so we cannot say for sure. However, given that PRRE-dependent localization required LARP1 and there is much evidence about LARP1 binding pyrimidine-rich elements (including those that are not cap-proximal as the reviewer notes), we believe this to be the most likely explanation.

      It should also be noted here that while pyrimidine-rich motif position within the 5’ UTR may not matter, its location within the transcript does. PRREs located within 3’ UTRs were unable to direct RNA localization (Figure 5).

      3) In figure 1A, they indicate that mRNA stability can contribute to RNA localization, but this point is never discussed. This may be important to their work since Larp1 has also been found to impact mRNA half-lives (Aoki et al, 2013 and Mattijssen et al 2020, Al-Ashtal et al 2021). Is it possible the effect they see when Larp1 is depleted comes from decreased stability?

      We found that PRRE-containing reporter transcripts were generally less abundant than their mutant counterparts in C2bbe1, HCA7, and MDCK cells (figure 3 – figure supplements 5, 6, and 8) although the effect was not consistent in mouse neuronal cells (figure 3 – figure supplement 13).

      However, we don’t think it is likely that the changes in localization are due to stability changes. This abundance effect did not seem to be LARP1-dependent as both PRRE-containing and PRRE-mutant reporters were generally more expressed in LARP1-rescue epithelial cells than in LARP1 KO cells (figure 4 – figure supplement 9).

      It should be noted here that we are not ever actually measuring transcript stability but rather steady state abundances. It cannot therefore be ruled out that LARP1 is regulating the stability of our PRRE reporters. Given, though, that their localization was dependent on kinesin activity (figures 7F, 7G), we believe the most likely explanation for the localization effects is active transport.

      4) Also Moor et al, 2017 saw that feeding cycles changed the localization of 5'TOP mRNAs. Similarly, does mTOR inhibition or activation or simply active translation alter the localization patterns they observe? Further evidence for dynamic regulation of RNA localization would strengthen this paper

      We are very interested in this and have begun exploring it. We have data suggesting that PRREs also mediate the feeding cycle-dependent relocalization of RP mRNAs. As the reviewer says, we think this leads to a very attractive model involving mTOR, and we are currently working to test this model. However, we don’t have the room to include those results in this manuscript and would instead prefer to include them in a later manuscript that focuses on nutrient-induced dynamic relocalization.

      5) For smFISH quantification, is every mRNA treated as an independent measurement so that the statistics are calculated on hundreds of mRNAs? Large sample sizes can give significant p-values but have very small differences as observe for Firefly vs. OSBPL3 localization. Since determining the biological interpretation of effect size is not always clear, I would suggest plotting RNA position per cell or only treat biological replicates as independent measurements to determine statistical significance. This should also be done for other smFISH comparisons

      This is a good suggestion, and we agree that using individual puncta as independent observations will artificially inflate the statistical power in the experiment. To remedy this in the epithelial cell images, we first reanalyzed the smFISH images using each of the following as a unique observation: the mean location of all smFISH puncta in one cell, the mean location of all puncta in a field of view, and the mean location of all puncta in one coverslip. With each metric, the results we observed were very similar (Reviewer Figure 3) while the statistical power of course decreased. We therefore chose to go with the reviewer-suggested metric of mean transcript position per cell.

      Reviewer Figure 3. C2bbe1 monolayer smFISH spot position analysis. RNA localization across the apicobasal axis is measured by smFISH spot position in the Z axis. This can be plotted for each spot, where thousands of spots over-power the statistics. Spot position can be averaged per cell as outlined manually within the FISH-quant software. This reduces sample size and allows for more accurate statistical analysis. When spot position is averaged per field of view, sample size further decreases, statistics are less powered but the localization trends are still robust. Finally, we can average spot position per coverslip, which represents biological replicates. We lose almost all statistical power as sample size is limited to 3 coverslips. Despite this, the localization trends are still recognizable.

      When we use this metric, all results remain the same with the exception of the smFISH validation of endogenous OSBPL3 localization. That result loses its statistical significance and has now been omitted from the manuscript. All epithelial smFISH panels have been updated to use this new metric, and the number of cells associated with each observation is indicated for each sample.

      For the neuronal images, these were already quantified at the per-cell level as we compare soma and neurite transcript counts from the same cell. In lieu of more imaging of these samples, we chose to perform subcellular fractionation into soma and neurite samples followed by RT-qPCR as an orthogonal technique (figure 3K, figure 3 supplement 14). This technique profiles the population average of approximately 3 million cells.

      6) F: How was the segmentation of soma vs. neurites performed? It would be good to have a larger image as a supplemental figure so that it is clear the proximal or distal neurites segments are being compared

      All neurite vs. soma segmentations were done manually. An example of this segmentation is included as Reviewer Figure 4. This means that often only proximal neurites segments are included in the analysis as it is often difficult to find an entire soma and an entire neurite in one field of view. However, in our experience, inclusion of more distal neurite segments would likely only strengthen the smFISH results as we often observe many molecules of localized transcripts in the distal tips of these neurites.

      Reviewer Figure 4. Manual segmentation of differentiated CAD soma and neurite in FISH-quant software. Neurites that do not overlap adjacent neurites are selected for imaging. Often neurites extend beyond the field of view, limiting this assay to RNA localization in proximal neurites.

      Also, it should be noted that the neuronal smFISH results are now supplemented by experiments involving subcellular fractionation and RT-qPCR (figure 3 supplement 14). These subcellular fractionation experiments collect the whole neurite, both the proximal and distal portions.

      Text has been added to the methods under the header “smFISH computational analysis” to clarify how the segmentation was done.

    1. Author Response

      Reviewer #3 (Public Review):

      Main results:

      1) TCR convergence is different from publicity: The authors look at CDR3 sequence features of convergent TCRs in the large Emerson CMV cohort. Amino usage does not perfectly correlate with codon degeneracy, for example, arginine (which has 6 codons) is less common in convergent TCRs, whereas leucine and serine are elevated. It's argued that there's more to convergence than just recombination biases, which makes sense. (I wonder if the trends for charged amino acids could be explained by the enrichment of convergent TCRs in CD8 T cells, which tend to have more acidic CDR3 loops). There's also a claim that the overlap between convergent and public TCRs is lower in tumors with a high mutational burden (TMB), but this part is sketchy: the definition of public TCRs is murky and hard to interpret, and the correlation between TMB and convergence-publicity overlap is modest (two cohorts with low TMB have higher overlap, and the other three have lower, but there is no association over those three, if anything the trend is in the other direction). It's also not clear why the overlap between COVID19 cohort convergent TCRs and public TCRs defined by the pre-2019 Emerson cohort should be high. A confounder here is the potential association between convergence and clonal expansion since expanded clonotypes can spawn apparently convergent TCRs due to sequencing errors. The paper "TCR Convergence in Individuals Treated With Immune Checkpoint Inhibition for Cancer" (Ref#5 here) gives evidence that sequencing errors may be inflating convergence in this specific dataset.

      We really appreciate the reviewer’s feedback. We respond to each of the reviewer’s points below:

      (1) Amino acid preference of convergent TCRs might be caused by CD8+ T cell enrichment. To test this hypothesis, we performed the same analysis using only CD8+ T cells (using the Cader 2019 lymphoma cohort). The results are shown below. We do not observe significant changes after excluding CD4+ T cells, indicating that this enrichment might be caused by factors other than CD4/CD8 differences.

      (2) Definition of public TCRs. We have changed the definition of public TCRs. Instead of mixing the Emerson cohort into each group and using the mixed cohort to define the public TCRs, we just used the 666 samples of the Emerson cohort to define the same set of public TCRs and applied them to each cohort. Both the dataset and the approach used in this manuscript is consistent with a previous study on the same topic (Madi et al., 2014, elife).

      (3) Convergence-publicity overlap: We agree with the reviewer that some high TMB tumors did not show further decrease of convergence-publicity overlap. One potential explanation is that the correlation between the two is not linear. By adding additional cohorts in this revision (healthy and recovered COVID-19 patients), we confirmed the previously observed overall trend between TMB and the overlap, which supported our conclusions (see figure below). On the other hand, we believe that the high overlap of convergent TCRs among healthy cohorts might result from exposure to common antigens. In the cancer patients, while still exposed, private antigens derived from tumor cells are expected to compete for resources, thus reducing the proportion of these public TCRs in the blood repertoire. The above discussion has been added to the revised manuscript:

      “Healthy individuals are expected to be exposed to common pathogens, which might induce public T cell responses. On the other hand, cancer patients have more neoantigens due to the accumulative mutation, which drives their antigen-specific T cells to recognize these 'private' antigens. This reduces the proportion of public TCRs in antigen-specific TCRs. Furthermore, a higher tumor mutation burden (TMB) would indicate a higher abundance of neoantigens, resulting in a lower ratio of public TCRs.”

      2) Convergent TCRs are more likely to be antigen-specific: This is nicely shown on two datasets: the large dextramer dataset from 10x genomics, and the COVID19 datasets from Adaptive biotech. But given previous work on TCR convergence, for example, the Pogorelyy ALICE paper, and many others, this is also not super-surprising.

      We thank the reviewer for bringing up this related work. In the Pogorelyy ALICE paper, the authors defined TCR neighbors based on one nucleotide difference of a given CDR3, which included both synonymous and non-synonymous changes. In other words, ALICE combines both convergence and mismatched (with hamming distance 1) sequences as neighbors. Although highly relevant, our approach is different by focusing only on the convergence, as mismatch has been extensively investigated by previous studies. We have now added this paper as Ref 27, and discussed the difference between ALICE and our method in the revised manuscript.

      3) Convergent T cells exhibit a CD8+ cytotoxic gene signature: This is based on a nice analysis of mouse and human single-cell datasets. One striking finding is that convergent TCRs are WAY more common in CD8+ T cells than in CD4+ T cells. It would be interesting to know how much of this could be explained by greater clonal expansion of CD8+ T cells, together with sequencing errors. A subtle point here is that some of the P values are probably inflated by the presence of expanded clonotypes: a group of cells belonging to the same expanded clonotype will tend to have similar gene expression (and therefore similar cluster membership), and will necessarily all be either convergent or not convergent collectively since they share the same TCR. So it's probably not quite right to treat them as independent for the purposes of assessing associations between gene expression clusters and convergence (or any other TCR-defined feature). You can see evidence for clonal expansion in Figure 3C, where TRAV genes are among the most enriched, suggesting that Cluster 04 may contain expanded clones.

      (1) We agree with the reviewer that a possible explanation of the CD8/CD4 difference is the larger cell expansion of CD8+ T cells. We tested this hypothesis by counting the number of T cell clones instead of cell number to remove the effect that would have been caused by CD8 T cell expansion. We first investigated the bulk TCR repertoire sequencing samples as Figure 3 - figure supplement 2C-2D (see figure below). We observed higher convergence levels for the CD8+ T cell clones compared to CD4+ T cells. The additional description of this topic was added at the last paragraph of the result section of “Convergent T cells exhibit a CD8+ cytotoxic gene signature” as follows:

      “The results may be explained by larger cell expansions of CD8+ T cells than CD4+ T cells. Therefore, we calculated the number of convergent clones within CD8+ T cells and CD4+ T cells from the above datasets to exclude the effects of cell expansion. As a result, in the scRNA-seq mouse data, while only 1.54% of the CD4+ clones were convergent, 3.76% of the CD8+ clones showed convergence. Likewise, 0.17% of convergent CD4+ T cell clones and 1.03% of convergent CD8+ T cell clones were found in human scRNA-seq data. In the bulk TCR-seq lymphoma data, similar results were also observed, where the gap between the convergent levels of CD4+ and CD8+ T cells narrowed but remained significant (Figure 3—figure supplement 2C-2D). In conclusion, these results suggest that CD8+ T cells show higher levels of convergence than CD4+ T cells, which substantiated our hypothesis that convergent T cells are more likely antigen-experienced. This observation has been tested using multiple datasets with diverse sequencing platforms and sequencing depth to minimize the impact of batch or other technical artifacts.”

      (2) We next investigated the effect of cell expansion in the single cell analysis. We agree with the reviewer that some highly-expanded convergent clones could inflate the p-value. Therefore, we revised the calculation of TCR convergence by using the T cell clone instead of individual cells. We observed that the clusters of interest mentioned in the paper (for both mouse and human data) remain at the top convergent level among all clusters (see table below), with p values estimated using Binomial exact test. These results supported our hypothesis that TCR convergence is enriched for T cell clusters that are more likely antigen-experienced.

      4) TCR convergence is associated with the clinical outcome of ICB treatment: The associations for the first analysis are described as significant in the text, and they are, but just barely (0.045 and 0.047, but you have to check the figure to see that).

      As suggested by the reviewer, we have added the p-value to the test so that it is easier to see. In this revision, we adopted another definition of convergent level, changing from the ratio of convergent TCR to the actual number of convergent T cell clones within each sample. The p-values were more significant using this new indicator (0.02 and 0.00038). To avoid the effect of other variables that might be correlative with convergent levels, especially the sequencing depth, the multivariate Cox model was used for both datasets tested in the paper, correcting for TCR clonality, TCR diversity and sequencing depth (and different treatment methods for melanomas data). As a result, convergence remains significantly prognostic after adjusting for the additional variables.

      5) Introduction/Discussion: Overall, the authors could do a better job citing previous work on convergence, for example, papers from Venturi on convergent recombination and the work from Mora and Walczak (ALICE, another recombination modeling). They also present the use of convergence as an ICB biomarker as a novel finding, but Ref 5 introduces this concept and validates it in another cohort. Ref 5 also has a careful analysis of the link between sequencing errors and convergence, which could have been more carefully considered here.

      We thank the reviewer for this excellent suggestion. We have added the citation of Venturi on convergent recombination as Ref 43 and we cited it at the last paragraph of the result selection:

      “Convergent recombination was claimed to be the mechanistic basis for public TCR response in many previous studies(Quigley et al., 2010; Venturi et al., 2006).”

      We also included work from Mora and Walczak in the fourth paragraph of the introduction and the third paragraph of the discussion as Ref 27 to introduce this TCR similarity-based clustering method as well as its application in predicting ICB response:

      “This idea has led several TCR similarity-based clustering algorithms, such as ALICE (Pogorelyy et al., 2019), TCRdist (Dash et al., 2017), GLIPH2 (Huang et al., 2020), iSMART (Zhang et al., 2020), and GIANA (Zhang et al., 2021), to be developed for studying antigen-driven T cell expansion during viral infection or tumorigenesis.”

      “In addition, the potential prognostic value of TCR convergence and TCR similarity-based clustering was testified in other studies(Looney et al., 2019; Pogorelyy et al., 2019).”

      Ref 5 was recited while discussing the effect of sequencing error on TCR convergence in the fourth paragraph of discussion:

      “Improper handling of sequencing errors may result in the overestimation of TCR convergence (Looney et al., 2019).”

    1. Author Response:

      We have now revised the manuscript to address the helpful comments and criticisms from the reviewers. The revised manuscript includes additional experiments demonstrating that inclusion of Csn2/Cas9 in the in vitro assays does not suppress the disintegration activity of Cas1-Cas2 to favor integration. These additional factors do not confer strand selectivity on integration either. Furthermore, the results of integration reactions using substrates mimicking PAM-containing pre- spacers have also been added.

      New figures and figure modifications at a glance:

      1) The new Figure 2 shows Cas1-Cas2 reactions in a linear target site and the effects of Csn2 and/or Cas9 on proto-spacer insertion into this target (Reviewer 1).

      The original Figure 2 (with slight modifications) is now moved to ’Supplementary Data’ as Figure 2-figure supplement 2, and shows proto-spacer insertion by Cas1-Cas2 into a nicked linear target site (Reviewer 2). Figure 2 is the only one in the main set of figures that has been extensively modified.

      2) The new Figure 2-figure supplement 1 (under ‘Supplementary Data’) shows the effects of Csn2, Cas9 or both on proto-spacer integration-disintegration by Cas1-Cas2 when the target site is present in a supercoiled plasmid (Reviewer 1).

      3) The new Figure 4-figure supplement 1 lists the sequences of the full- and half-target sites used for the reactions shown in Figure 4 (Reviewer 2).

      4) The new Figure 2-figure supplement 3 shows the insertion properties of PAM-containing pre- spacer mimics in reactions with Cas1-Cas2 alone or supplemented with Csn2, Cas9 or both (Reviewer 1).

      5) The new Figure 6-figure supplement 1 gives a structural perspective of the trombone substrates used for the reactions shown in Figure 6B, C (Reviewer 1).

      6) The original Supplementary Figure S8 showing assays for PAM-specific cleavage by Cas1- Cas2 has been removed (Reviewer 1).

      7) There are no changes in the other figures under ‘Supplementary Data’, although several have new numbers consistent with the revisions made.

      Public Review (Reviewers #1 and #2):

      The present work is a critical extension of the in vitro biochemical activities of the Cas1- Cas2 complex described by Wright and Doudna (Nat Struct Mol Biol, 2016; 23: 876-883). We have kept all experimental conditions nearly identical to those used by these authors to make the results from the two studies directly comparable. Importantly, we now show that the prior model for proto-spacer integration into the CRISPR locus by Cas1-Cas2 is an oversimplification of a much more nuanced mechanism.

      While both reviewers recognize the importance of our findings in challenging the current thinking on the adaptation mechanism of CRISPR immunity, they express reservations as to whether the in vitro results recapitulate the in vivo mechanism of spacer acquisition. This seems to us to be too broad a criticism from which few (if any) biochemical experiments can be immune.

      Our key finding is that disintegration during the second step of proto-spacer integration generates a DNA structure that has all the hallmarks of a DNA damage intermediate that the bacterial repair machinery can readily process into an authentic integration product. We invoke no new or ad hoc mechanisms, and the model we propose fits neatly into the DNA gap-filling mechanisms known to operate in DNA transposition pathways.

      The proto-spacer is functionally a ‘micro-transposon’, whose shortness imposes severe torsional strain on the transposition intermediate that precedes the final integration product. In vitro experiments suggest that transcription is potentially capable of resolving this intermediate (Budhathoki et al., Nat Struct Mol Biol, 2020, 27: 489-99). In principle, replication can also accomplish this task. Our study now demonstrates that simply nicking the DNA (disintegration) is an equally effective solution for relieving the topological stress accompanying integration. DNA loose ends can then be readily tied up by the bacterial repair machinery.

      We concur with the concluding sentence of reviewer 2, “The simple conclusion that Cas1- Cas2 catalyzed hydrolysis of a phosphodiester may relieve strain and allow productive transposition to occur doesn’t get emphasized enough in my opinion.” We have now expanded on this point in the revised ‘Discussion’.

      Reviewer #1:

      In addition, the in vitro system used here is only partially reconstituted. The substrates lack a PAM sequence, which is necessary for protospacers to be incorporated in the correct orientation and may help direct the first integration event to the L-R junction. Presumably because of this all the reactions presented do not analyze the orientation of the incorporated prespacer sequence. Cas9 and Csn2 are also absent (as are other potentially required host factors), which are necessary for correct integration in vivo.

      1A. Strand specificity: The in vitro integration reactions with the Cas1-Cas2 complex were done using a protospacer of the optimal size (26 nt on each strand with the four 3’- proximal bases on each strand as unpaired). Either proto-spacer strand is equally competent to initiate the strand transfer reaction, as could be inferred from Figure 3 of the original submission. Here, reactions utilized modified proto-spacers that differed in their top and bottom strand lengths. They gave two insertion products (IP) each at the L-R (leader-repeat) and R-S (repeat-spacer) junctions of a normal target site. In modified targets in which integration was limited to just the L- R junction, two insertion products were formed. One panel of Figure 3 (which is retained in the revised manuscript) showing the four insertion products from the normal target (lane 10) and two from the modified targets (lanes 11-13) for a protospacer with 26 nt and 31 nt long strands is displayed below.

      The ability of either proto-spacer strand to initiate integration is now more directly shown in Figure 2 (new) of the revised manuscript. Here the labeled top or bottom strand of the proto- spacer (PS) gave insertion products (IP) at the L-R and R-S junctions of the target site. Panel B of Figure 2 (pasted below) demonstrates this result.

      1B. Cas9, Csn2 included reactions: The data for reactions containing Csn2 or Cas9 or both were not shown previously, as they did not alter Cas1-Cas2 activity by promoting strand specificity of integration or suppressing disintegration. These results are now shown in the revised Figure 2 (linear target) and the new Figure 2-figure supplement 1 (supercoiled target). Portions of these figures are shown below.

      The relevant revised text describing the lack of strand specificity to proto-spacer integration by Cas1-Cas2 and the Csn2/Cas9 effects on integration is pasted below.

      Page 15, lines 229-235.

      "Unlike orientation-specific proto-spacer integration in vivo, Cas1- Cas2 reactions in vitro showed no strand-specificity (Figure 2B). This bias-free insertion of the top or bottom strand from the proto-spacer was unchanged by the addition of Csn2 or Cas9 or both to the reactions (Figure 2C-E). These proteins, singly or in combiantion, also failed to stabilize proto-spacer integrations in the supercoiled plasmid target (Figure 2-figure supplement 1). Instead, they inhibited plasmid relaxation. Inhibition could occur at the level of integration per se or strand rotation during integration-disintegration"

      1C. PAM-containing substrates: We have now tested Cas1-Cas2 activity (with and without added Csn2 or Cas9 or both) on PAM-containing substrates that mimic ‘pre-spacers’, Figure 2- figure supplement 3 (new).

      In these substrates, a proto-spacer strand of the standard length (26 nt; lacking PAM or its complement) is inserted at the L-R junction with higher efficiency than the longer strand (containing PAM or its complement). Following the first integration at L-R, the pre-spacer mimics containing > 26 nt in one strand or both strands are inhibited in the second strand transfer to the R-S junction. A portion of Figure 2-figure supplement 3 illustrating theses points is shown below.

      The revised ‘Results’ section has the following added description of the activities of PAM- containing pre-spacer mimics.

      Pages 16-19, lines 265-297. Cas1-Cas2 activity on pre-spacer mimics carrying the PAM sequence

      "The strand cleavage and strand transfer steps of proto-spacer insertion at the CRISPR locus must engender safeguards against self-targeting of the inserted spacer as well as its non-functional orientation. However, no strand selectivity is seen in the in vitro Cas1-Cas2 reactions with already processed proto-spacers lacking the PAM sequence (Figures 2 and 3). By coordinating PAM- specific cleavage of a pre-spacer with transfer of this cleaved strand to the L-R junction, the inserted spacer will be in the correct orientation to generate a functional crRNA. To examine this possibility, we tested the integration characteristics of pre-spacer mimics containing the PAM sequence.

      The inclusion of PAM or PAM and its complement in the integration substrates (Figure 2- figure supplement 3A) did not confer strand specificity on reactions with Cas1-Cas2 alone or with added Csn2, Cas9 or both (Figure 2-figure supplement 3B-E). Optimal integration by Cas1-Cas2 occurred with the 26 nt strands of the native protospacer with their 4 nt 3’-overhangs (Figure 2- figure supplement 3B-E; lanes 2). The pre-spacer mimics containing one or both > 26 nt strands had reduced integration competence (Figure 2-figure supplement 3B-E; lanes 4). Even here, the 26 nt strand with the 4 nt overhang (Figure 2-figure supplement 3C; lane 4) was preferred in integration over the longer 29nt PAM-containing strand (Figure 2-figure supplement 3D; lane 4) or the 33 nt PAM complement-containing strand (Figure 2-figure supplement 3E; lane 4). In contrast to the processed proto-spacer that gave nearly equal integration at L-R and R-S, IP(L- R) ≈ IP(R-S) (Figure 2-figure supplement 3B-E; lanes 2), the longer pre-spacer mimics were inhibited in integration at R-S, IP(L-R) > IP(R-S) (Figure 2-figure supplement 3B-E lanes 4). This is the expected outcome if the initial strand transfer occurs at L-R, and a ruler-like mechanism orients the reactive 3’-hydroxyl for the second strand transfer at R-S. This sequential two-step scheme for proto-spacer integration is consistent with the results shown in Figure 3 as well. These reaction features were not modulated by Csn2 or Cas9 (Figure 2-figure supplement 3B-E; lanes 6 and 8), although Csn2 plus Cas9 was inhibitory (Figure 2-figure supplement 3B-E; lanes 10).

      There is no evidence for integration accompanying PAM-specific cleavage in our in vitro reactions. In the E. coli CRISPR system, Cas1-Cas2 is apparently sufficient for PAM-specific cleavage in vitro (22). By contrast, in the S. pyogenes system, cleavage is attributed to Cas9 or as yet uncharacterized bacterial nuclease(s) (35). The mechanism for generating an integration- proficient and orientation-specific proto-spacer, which may not be conserved among CRISPR systems, is poorly understood at this time."

    1. Author Response

      Reviewer #1 (Public Review):

      Kazrin appears to be implicated in many diverse cellular functions, and accordingly, localizes to many subcellular sites. Exactly what it does is unclear. The authors perform a fairly detailed analysis of Kazrin in-cell function, and find that it is important for the perinuclear localization of TfN, and that it binds to members of the AP-1 complex (e.g., gamma-adaptin). The authors note that the C-terminus of Kazrin (which is predicted to be intrinsically disordered) forms punctate structures in the cytoplasm that colocalize with components of the endosomal machinery. Finally, the authors employ co-immunoprecipitation assays to show that both N and C-termini of Kazrin interacts with dynactin, and the dynein light-intermediate chain.

      Much of the data presented in the manuscript are of fairly high quality and describe a potentially novel function for Kazrin C. However, I had a few issues with some of the language used throughout, the manner of data presentation, and some of their interpretations. Most notably, I think in its current form, the manuscript does not strongly support the authors' main conclusion: that Kazrin is a dynein-dynactin adaptor, as stated in their title. Without more direct support for this function, the authors need to soften their language. Specific points are listed below.

      Major comments:

      1) I agree with the authors that the data provided in the manuscript suggest that Kazrin may indeed be an endosomal adaptor for dynein-dynactin. However, without more direct evidence to support this notion, the authors need to soften their language stating as much. For example, the title as stated would need to be changed, as would much of the language in the first paragraph of the discussion. Alternatively, the manuscript could be significantly strengthened if the authors performed a more direct assay to test this idea. For example, the authors could use methods employed previously (e.g., McKenney et al., Science 2014) to this end. In brief, the authors can simply use their recombinant Kazrin C (with a GFP) to pull out dynein-dynactin from cell extracts and perform single molecule assays as previously described.

      While this is certainly an excellent suggestion, the in vitro dynein/dynactin motility assays are really not straight forward experiments for laboratories that do not use them as a routine protocol. That is why we asked Dr. Thomas Surrey (Centre for Genomic Regulation, Barcelona), an expert in the biochemistry and biophysics of microtubule dynamics, to help us with this kind of analysis. In their setting, TIRF microscopy is used to follow EGFPdynein/dynactin motility along microtubules immobilized on cover slides (Jha et al., 2017). As shown in figure R1, more binding of EGFP-dynein to the microtubules is observed when purified kazrin is added to the assay (from 20 to 400 nM), but there is no increase in the number or processivity of the EGFP-dynein motility events. These results are hard to interpret at this point. Kazrin might still be an activating adaptor but a component is missing in the assay (i. e. an activating posttranslational modification or a particular subunit of the dynein or dynactin complexes), or it could increase the processivity of dyneindynactin in complex with another bona fide activating adaptor, as it has been demonstrated for LIS1 (Baumbach et al., 2017; Gutierrez et al., 2017). Alternatively, kazrin could transport dynactin and/or dynein to the microtubule plus ends in a kinesin 1-dependent manner, in order to load the peripheral endosomes with the minus end directed motor (Yamada et al., 2008).

      Figure R1. Kazrin C purified from E. coli increases binding of dynein to microtubules but does not increase the number or processivity of EGFP-dynein motility events. A. TIRF (Total Internal Reflexion Fluorescence) micrographs of microtubule-coated cover slides incubated in the presence of 10 nM EGFP-dynein and 20 nM dynactin in the presence or absence of 20 nM kazrin C, expressed and purified from E. coli. B. Kymographs of TIRF movies of microtubule-coated cover slides incubated in the presence of purified 10 nM EGFP-dynein, 20 nM dynactin and either 400 nM of the activating adaptor BICD2 (1:2:40 ratio) (left panel) or kazrin C (right panel). Red squares indicate processive dynein motility events induced by BICD2”.

      Investigating the molecular activity of kazrin on the dynein/dynactin motility is a whole project in itself that we feel it is out of the scope of the present manuscript. Therefore, as suggested by the BRE, we have chosen to soften the conclusions and classify kazrin as a putative “candidate” dynein/dynactin adaptor based on its interactome, domain organization and subcellular localization, as well as on the defects installed in vivo on the endosome motility upon its depletion. We also discuss other possibilities as those outlined above.

      2) I'm not sure I agree with the use of the term 'condensates' used throughout the manuscript to describe the cytoplasmic Kazrin foci. 'Condensates' is a very specific term that is used to describe membraneless organelles. Given the presumed association of Kazrin with membrane-bound compartments, I think it's more reasonable to assume these foci are quite distinct from condensates.

      We actually used condensates to avoid implying that the kazrin IDR generates membraneless compartments or induces liquid-liquid-phase separation, which is certainly not a conclusion from the manuscript. However, since all reviewers agreed that the word was misleading, we have substituted the term condensates for foci throughout the manuscript.

      3) The authors note the localization of Tfn as perinuclear. Although I agree the localization pattern in the kazKO cells is indeed distinct, it does not appear perinuclear to me. It might be useful to stain for a centrosomal marker (such as pericentrin, used in Figure 5B) to assess Tfn/EEA1 with respect to MT minus ends.

      We have now changed the term perinuclear, which implies that endosomes surround the nucleus, by the term juxtanuclear, which more accurately define what we wanted to indicate (close to). We thank the reviewer for pointing out this lack of accuracy. We also more clearly describe in the text that in fibroblast, the Golgi apparatus and the Recycling Endosomes (REs) gather around the pericentriolar region ((Granger et al., 2014) and reference therein), which is usually close to the nucleus ((Tang and Marshall, 2012) and references therein). Nevertheless, as suggested by the reviewer, we have included pictures of the TxR-Tfn and EEA1-labelled endosomes accumulating around pericentrin in wild type mouse embryonic fibroblast (MEF) (Figure 1–supplement figure 3) to illustrate these points.

      4) "Treatment with the microtubule depolymerizing drug nocodazole disrupted the perinuclear localization of GFP-kazrin C, as well as the concomitant perinuclear accumulation of EE (Fig. 5C & D), indicating that EEs and GFP-kazrin C localization at the pericentrosomal region required minus end-directed microtubule-dependent transport, mostly affected by the dynactin/dynein complex (Flores-Rodriguez et al., 2011)."

      • I don't agree that the nocodazole experiment indicates that minus end-directed motility is required for this perinuclear localization. In the absence of other experiments, it simply indicates that microtubules are required. It might, however, "suggest" the involvement of dynein. The same is true for the subsequent sentence ("Our observations indicated that kazrin C can be transported in and out of the pericentriolar region along microtubule tracks...").

      We agree with the reviewer. To reinforce the point that GFP-kazrin C localization and the pericentriolar accumularion of EEA1 rely on dynein-dependent transport, we have now added an experiment in figure 5E and F, where we use ciliobrevin to inhibit dynein in cells expressing GFP-kazrin C. In the treated cells, we see that the GFP-kazrin C staining in the pericentrin foci is lost and that EEs have a more dispersed distribution, similar to kazKO MEF. We have also completed and rearranged the in vivo fluorescence microscopy data to more clearly show that small GFP-kazrin C foci can be observed moving towards the cell centre (Figure 5-S1 and movies 6 and 7). Taken all this data together, I think we can now suggest that kazrin might travel into the pericentriolar region, possibly along microtubules and powered by dynein.

      5) Although I see a few examples of directed motion of Tfn foci in the supplemental movies, it would be more useful to see the kymographs used for quantitation (and noted by the authors on line 272). Also related to this analysis, by "centripetal trajectories", I assume the authors are referring to those moving in a retrograde manner. If so, it would be more consistent with common vernacular (and thus more clear to readers) to use 'retrograde' transport.

      We have now included some more examples of the time projections used in the analysis in figure 6-S1 and 2, where we have coloured in blue the fairly straight, longer trajectories, as opposed to the more confined movements that appeared as round dots in the time projections (coloured in red). We have also added more videos illustrating the differences observed in cells expressing endogenous or GFP-kazrin C versus kazKO cells or kazKO cells expressing GFP or GFP-kazrin C-Nt. Movies 8 and 11 show the endosome motility in representative WT and kazKO cells (movie 8) and kazKO cells expressing GFP, GFPkazrin C or GFP-kazrin C Nt (movie 11). Movies 9 and 10 show endosome motility in four magnified fields of different WT and kazKO cells, where longer and faster motility events can be observed when endogenous kazrin is expressed. Movies 12 to 14 show endosome motility in four magnified fields of different kazKO cells expressing, GFP-kazrin C (movie 12), GFP (movie 13) and GFP-kazrin C-Nt (movie 14). Longer and faster movements can be observed in the different insets of movie 12, as compared with movies 13 and 14. Finally, as suggested by the reviewer, we have re-worded centripetal movement to retrograde movement throughout the manuscript.

      6) The error bars on most of the plots appear to be extremely small, especially in light of the accompanying data used for quantitation. The authors state that they used SEM instead of SD, but their reasoning is not stated. All the former does is lead to an artificial reduction in the real deviation (by dividing SD by the square root of whatever they define as 'n', which isn't clear to me) of the data which I find to be misleading and very nonrepresentative of biological data. For example, the error bars for cell migration speed in Figure 2B suggest that the speeds for WT cells ranged from ~1.7-1.9 µm/sec, which I'm assuming is largely underrepresenting the range of values. Although I'm not a statistician, as someone that studies biochemical and biological processes, I strongly urge the authors to use plots and error bars that more accurately describe the data to your readers (e.g., scatter plots with standard deviation are the most transparent way to display data).

      We have now changed all plots to scattered plots with standard deviations, as suggested.

    1. Author Response

      Reviewer #2 (Public Review):

      Wang et al. elegantly exploit single-cell RNA-seq datasets to question the putative involvement of lncRNAs in human germ cell development. In the first part of the study, the authors use computational approaches to identify and characterize, from existing data, lncRNAs expressed in the germline. Of note, the scRNA-seq data used were generated from polyA+ RNAs, and thus non-polyadenylated lncRNAs could not be retrieved. Most of the lncRNAs identified in the germ cells and in the somatic cells of the gonads were previously unannotated. While this increases the catalog of lncRNA genes in the human genome, further characterization is needed to determine which fraction of these newly identified lncRNAs represent bona fide transcripts or transcriptional noise.

      Differential expression analysis between developmental stages, sexes, or cell types led to several observations: (i) whatever the stage of development, the number of expressed lncRNAs is higher in fetal germ cells compared to gonadal somatic cells; (ii) there is a continuous increase in the number of expressed lncRNA during the development of the germline; of note, a similar, although the more subtle trend is observed for protein-coding genes; (iii) the developmental stage at which there is the highest number of lncRNA expressed differs between male and female germ cells. While convincing, the significance of these observations is difficult to assess. However, the authors remain prudent with their conclusion and are not over-interpreting their findings.

      We appreciate Reviewer #2 precise summary of our analysis and highlighting the significances of these datasets for other researchers and future studies.

      Interestingly, integrating lncRNA expression to classify cell types led to the identification of a novel population of cells in the female germline that had not been revealed by protein-coding gene only-based classification. The biological relevance of this population, which cluster with mitotic populations, remains to be demonstrated. Finally, by examining lncRNA biotype, the authors could demonstrate an enrichment, in the germ cells, of the antisense head-to-head organization (in relation to the nearby protein-coding gene) compared to other biotypes. Whether this is different from the general distribution of lncRNA should be discussed.

      We analyzed the lncRNAs in NONCODEv5 database (human genome), and the result showed that XH type occupied 21.73% of the intragenic lncRNA-mRNA pairs in NONCODEv5 database (human genome), which is lower than 26.58% in fGC and 26.23% in mGC (Response Figure 1).

      Response Figure 1. Genomic distribution and biotypes of the lncRNAs in NONCODEv5 database and lncRNAs expressed in human gonad.

      In the second part of the manuscript, Wang et al focus on one pair of divergent lncRNA-protein coding genes (LNC1845-LHX8). To document the choice of this particular pair, it would be informative to have its correlation score indicated in Figure 3C. he existence of this transcript was validated using female fetal ovaries, and its function was addressed in late primordial germ cells like cells (PGCLC) derived from human embryonic stem cells (hESCs). The authors have used an admirable set of orthogonal approaches that led them to conclude as to a role for LNC1845 in regulating in cis the nearby gene LHX8. They further went on to identify the underlying mechanisms, which involve modification of the chromatin landscape through direct interaction of LNC1845 with a histone modifier. Among the different strategies used (KO, stop transcription, overexpression), the shRNA-mediated knock-down is the only one to specifically address the function of the transcript itself, as opposed to the active transcription. The result of this experiment led the authors to conclude that the LNC1845 RNA is functional, a conclusion that is reinforced by the demonstration of physical interaction between the LNC1845 RNA and WDR5, a component of MLL methyltransferase complexes. The result of the KD experiment is however puzzling as RNAi has been shown not to be the method of choice for targeting nuclear lncRNAs (Lennox et al. NAR 2016).

      We thank the Reviewer #2’s suggestion to add the correlation score of LNC1845-LHX8 pair and the Pearson Correlation of this pair is 0.3268. We have added the number to Figure 4C because which the expression correlation of LNC1845 and LHX8 was first mentioned. We have compared many other similar studies, shRNA knockdown has been widely used to target nuclear lncRNAs (Guttman et al. Nature 2011; Luo et al. Cell Stem Cell 2016; Subhash et al. Nucleic Acids Res. 2018; Li et al. Genome Res 2021), and the knockdown efficiency seemed to be feasible and acceptable to be used. The knockdown results are consistent with the deletion mutation and stop transcription approaches, all three showed that LNC1845 transcriptional expression is required for proper LHX8 expression in late PGCLCs.

      Overall, the functional investigation is convincing and strengthened by the inclusion of multiple clones for each approach, and by the convergence in the outcome of each individual approach. The depth of characterization is also remarkable. The analyses of the mechanisms at stake are somehow less solid, as there is less evidence demonstrating the involvement of the LNC1845 RNA and its interaction with WDR5.

      We have added more experimental evidence to strengthen the model especially the interaction of LNC1845 and WDR5. Apart from the RIP-qPCR results of WDR5 demonstrating the enrichment of LNC1845 by WDR5 pulldown (Figure S8D), we performed chromatin isolation by RNA purification (ChIRP) assay using antisense oligos along the entire LNC1845 transcript sequence. ChIRP results confirmed that WDR5 protein were enriched when anti-LNC1845 oligo probes were used to isolate the complex but not the controls without the probes or without overexpression of LNC1845 transcript (Response Figure 2). Taken together, the findings of both approaches support the model that LNC1845 directly interacts with WDR5 to modulate the H3K4me3 modification for LHX8 transcriptional activation. (Related to supplementary figure 8D and 8E.)

      Response Figure 2. LNC1845 binding for WDR5 was verified by CHIRP-western blot.

      Altogether, this study provides a convincing demonstration of the role of a lncRNA on the regulation of a nearby gene in the context of the germline. However, to have a better understanding of the functionality of lncRNA genes in general, it would be interesting to know whether other pairs of lncRNA-PC genes have been functionally investigated in this context, where no function for the lncRNA gene could be demonstrated. Negative results are highly informative and if so, these could be included in the manuscript.

      We appreciate Reviewer #2 suggestion to add other lncRNA-PC gene pairs results. In fact, we have analyzed and presented the results of another 2 pairs in figure 7D. LncRNAs LNC3346 and LNC15266 were also transcriptionally regulated by FOXP3, and they may regulate their neighbor genes TMCO1 and MPP5, as figure 7D showed. Our analysis showed that other lncRNA-PC gene pairs may also have the similar transcriptional regulation as LNC1845-LHX8 during germ cell development.

    1. Author Response

      Reviewer #2 (Public Review):

      Charme is a long non-coding RNA reported by the authors in their previous studies. Their previous work, mainly using skeletal muscles as a model, showed the functional relevance of Charme, and presented data demonstrating its nuclear role, primarily via modulating the sub-nuclear localization of Matrin 3 (MATR3). Their data from skeletal muscles suggested that loss of the intronic region of Charme affects the local 3D genome organization, affecting MATR3 occupancy and this gene expression. Loss of Charme in vivo leads to cardiac defects. In this manuscript, they characterize the cardiac developmental defects and present molecular data supporting how the loss of Charme affects the cardiac transcriptome repertoire. Specifically, by performing whole transcriptome analysis in E12.5 hearts, they identify gene expression changes affected in developing hearts due to loss of Charme. Based on their previous study in skeletal muscles, they assume that Charme regulates cardiac gene expression primarily via MATR3 also in developing cardiomyocytes. They provide CLIP-seq data for MATR3 (transcriptome-wide foot printing of MATR3) in wild-type E15.5 hearts and connect the binding of MATR3 to gene expression changes observed in Charme knockout hearts. I credit the authors for providing CLIP seq data from in vivo embryonic samples, which is technically demanding.

      Major strengths:

      Although, as previously indicated by the authors in Charme knockout mice, the major strength is the effect of Charme on cardiac development. While the phenotype might be subtle, the functional data indicate that the role of Charme is essential for cardiac development and function. The combinatorial analysis of MATR3 CLIP-seq and transcriptional changes in the absence of Charme suggests a role of Charme that could be dependent on MATR3.

      We thank this reviewer for appreciating our methodological efforts and the importance of the MATR3 CLIP-seq data from in vivo embryonic samples.

      Weakness:

      (i) Nuclear lncRNAs often affect local gene expression by influencing the local chromatin.

      Charme locus is in close proximity to MYBPC2, which is essential for cardiac function, sarcomerogenesis, and sarcomere maintenance. It is important to rule out that the cardiac-specific developmental defects due to Charme loss are not due to (a) the influence of Charme on MYBPC2 or, of that matter, other neighboring genes, (b) local chromatin changes or enhancer-promoter contacts of MYBPC2 and other immediate neighbors (both aspects in the developmental time window when Charme expression is prominent in the heart, ideally from E11 to E15.5)

      Although the cis-activity represents a mechanism-of-action for several lncRNAs, our previous work does not reveal this kind of activity for pCharme. To add stronger evidence, we have now analysed the expression of pCharme neighbouring genes in cardiac muscle. Genes were selected by narrowing the analysis not only on the genes in “linear” proximity but also on eventual chromatin contacts, which may underlie possible candidates for in cis regulation. To this purpose, we made use of the analyses that in the meantime were in progress (to answer point iv) on available Hi-C datasets (Rosa- Garrido et al. 2017). Starting from a 1 Mb region around Charme locus, we found that most of the interactions with Charme occur in a region spanning from 240 kb upstream and 115 kb downstream of Charme for a total of 370 Kb (Rev#2_Capture Fig. 1A). This region includes 39 genes, 9 of them expressed in the neonatal heart but none showing significant deregulation (see Table S2). To note, this genomic region also included the MYBPC2 locus, for which we did not find a decreased expression in the heart from our RNA-seq data (Revised Figure 2-figure supplement 1C and Table S2). This trend was confirmed through RT-qPCR analyses of several genes from E15.5 extracts, which revealed no significant difference in their abundance upon Charme ablation (Rev#2_Capture fig. 1B).

      Fig. 1. A) Contact map depicting Hi-C data of left ventricular mice heart retrived from GEO accession ID GSM2544836. Data related to 1 Mb region around Charme locus were visualized using Juicebox Web App (https://aidenlab.org/juicebox/). B) RT-qPCR quantification of Charme and its neighbouring genes in CharmeWT vs CharmeKO E15.5.5 hearts. Data were normalized to GAPDH mRNA and represent means ± SEM of WT and KO (n=3) pools. Data information: p < 0.05; p < 0.01, **p < 0.001 unpaired Student’s t test.

      For a better understanding, we also checked possible “local” Charme activities in skeletal muscle cells, from previous datasets (Ballarino et al., 2018). We found that in murine C2C12 cells treated with two different gapmers against Charme, three of its neighbouring genes were expressed (Josd2, Emc10 and Pold1), but none showed significant alterations in their expression levels in response to Charme knock-down (Rev#2_Capture Fig. 2).

      Taken together, these results would exclude the possibility of Charme in cis activity as responsible for the phenotype.

      Fig. 2: Average expression from RNA-seq (FPKM) quantification of Charme neighbouring genes in C2C12 differentiated myotubes treated with Gap-scr vs Gap-Charme. Values for Gap-Charme represent the average values of gene expression after treatment with two different gapmers (GAP-2 and GAP-2/3).

      (ii) The authors provide data indicating cardiac developmental defects in Charme knockouts. Detailed developmental phenotyping is missing, which is necessary to pinpoint the exact developmental milestones affected by Charme. This is critical when reporting the cell type/ organ-specific developmental function of a newly identified regulator.

      We did our best to answer this concern.

      Let us first emphasise that, since their generation, we have never observed any particular tissue alteration, morphological or physiological, when dissecting the CharmeKO animals other than the muscular ones. The high specificity of pCharme expression, as also shown here by ISH (Figure 1C-D, Figure 1-figure supplement 1A-B, Figure 3A), together with the minimal alteration applied to the locus for CRISPR-Cas-mediated KO (PolyA insertion), strongly excludes the presence of an alteration in other tissues and their involvement in the development of the phenotype.

      Nevertheless, we now add more developmental details to the cardiac phenotype (see also Essential revision point 2).

      1- First of all, gene expression analyses performed at 12.5E, 15.5E, 18.5E and neonatal (PN2) stages allowed us to identify, at the molecular level, the developmental time point when CharmeKO effects on the cardiac muscle can be found. Our new results clearly indicate that the pCharme-mediated regulation of morphogenic and cardiac differentiation genes is detectable from E15.5 fetal stage onward (Rev#2_Capture Fig. 3/Revised Figure 2E). Together with the analysis of pCharme targets and coherently with the altered cardiac maturation and performance, this evidence is also supported by the analysis of the myosins Myh6/Myh7 ratio, which diminution in CharmeKO hearts starts from E15.5 up to 69% of control levels at PN stages (Revised Figure 2F).

      2- Hematoxylin-eosin staining of dorso-ventral cryosections from CharmeWT and CharmeKO hearts confirmed the fetal malformation at the E15.5 stage (Revised Figure 2G). Moreover, the hypotrabeculation phenotype of CharmeKO hearts, which was initially examined by immunofluorescence, now finds confirmation by the analysis of key trabecular markers (Irx3 and Sema3a), which expression significantly decreases upon pCharme ablation (Rev#1_Capture Fig. 3B/Revised Figure 2-figure supplement 1G).

      3- Finally, the gene expression analysis on Ki-67, Birc5 and Ccna2 (Revised Figure 2-figure supplement 1E) definitively rules out the influence of pCharme ablation on cell-cycle genes and cardiomyocytes proliferation, thus allowing a more careful interpretation of the embryonic phenotype. Note that, coherently with the lncRNA implication at later stages of development, the expression of important cardiac regulators, such as Gata4, Nkx2-5 and Tbx5, is not altered by its ablation at any of the tested time points (Rev#2_Capture Fig.3), while pCharme absence mainly affects genes which are expressed downstream of these factors.

      These new results have been included in the revised version of the manuscript and better discussed.

      Fig. 3: RT-qPCR quantification Gata4, Nkx2-5 and Tbx5 in CharmeWT and CharmeKO cardiac extract at E12.5, E15.5 and E18.5 days of embryonal development. Data were normalized to GAPDH mRNA and represent means ± SEM of WT and KO (n=3) pools.

      (iii) Along the same line, at the molecular level, the authors provide evidence indicating a change in the expression of genes involved in cardiogenesis and cardiac function. Based on changes in mRNA levels of the genes affected due to loss of Charme and based on immunofluorescence analysis of a handful of markers, they propose a role of Charme in cell cycle and maturation. Such claims could be toned down or warrant detailed experimental validation.

      See above, response to Reviewer #2 (Public Review) weakness (ii).

      (iv) Authors extrapolate the mechanistic finding in skeletal muscle they reported for Charme to the developing heart. While the data support this hypothesis, it falls short in extending the mechanistic understanding of Charme beyond the papers previously published by the authors. CLIP-seq data is a step in the right direction. MATR3 is a relatively abundant RBP, binding transcriptome-wide, mainly in the intronic region, based on currently available CLIP-seq data, as well as shown by the authors' own CLIP seq in cardiomyocytes. It is also shown to regulate pre-mRNA splicing/ alternative splicing along with PTB (PMID: 25599992) and 3D genome organization (PMID: 34716321). In addition, the authors propose a MATR3 depending molecular function for Charme primarily dependent on the intronic region of Charme and due to the binding of MATR3. Answering the following question would enable a better mechanistic understanding of how Charme controls cardiac development.

      (i) what are the proximal genomic regions in the 3D space to Charme locus in embryonic cardiomyocytes? Authors can re-analysis published Hi-C data sets from embryonic cardiomyocytes or perform a 4-C experiment using Charme locus for this purpose.

      See above, response to Reviewer #2 (Public Review) weakness (i).

      (ii) does the loss of Charme affect the splicing landscape of MATR3 bound pre-mRNAs in E12.5 ventricles in general and those arising from the NCTC region specifically?

      This is an intriguing issue, as also highlighted by new evidence showing that the reactivation of fetal-specific RNA-binding proteins, including MATR3, in the injured heart drives transcriptome-wide switches through the regulation of early steps of RNA transcription and processing (D'Antonio et al., 2022).

      Using the rMATS software on our neonatal RNA-Seq datasets we then investigated the effect of pCharme depletion on splicing, with a focus on NCTC. As shown in the Rev#2_Capture Fig.4A, all classical splicing alterations were investigated, such as exon-skipping, alternative 5’ splice site, alternative 3’ splice site, mutually excluded exons and intron retention. Intriguingly, we did observe a slight alteration in the splicing patterns, in particular considering exon skipping events (62% corresponding to 381 genes). Among them, the majority corresponded to exon exclusion events (237 events = 209 genes) while a smaller fraction to exon inclusion (144 events = 133 genes). Moreover, by intersecting these genes with the MATR3-bound RNAs we found a slightly significant enrichment (p=0,038) for exon inclusion (Rev#2_Capture Fig.4B).

      Regarding the NCTC locus, we demonstrate that in hearts pCharme acts through different target genes. Indeed, none of the NCTC-arising transcripts are bound by MATR3 (see Table S4) or substrate for alternative splicing regulation.

      While these results are very interesting for deepening the investigation of pCharme/MATR3 interplay, their biological significance needs to be further investigated through one-by-one analysis of specific transcripts. As a prosecution of the project, Nanopore sequencing of these samples on a MinION platform is currently undergoing in the lab to obtain a better characterization of alternative splicing events in response to the lncRNA ablation during development.

      Fig. 4: A) Left and middle panel: Pie Chart depicting the proportion of significantly altered (FDR < 0.05) splicing events detected by rMATS comparing neonatal CharmeWT and CharmeKO RNA-seq samples. All classical splicing alterations were investigated, such as exon-skipping, alternative 3’ splice site (A3SS), intron retention, alternative 5’ splice site (A5SS) and mutually excluded exons (MXE). Right panel. Volcano plot depicting significant exon skipping events in CharmeKO (FDR < 0.05, PSI<0 for excluded and included exons, FDR >= 0.05 for invariant exons). X-axis represent exon-inclusion ratio or Percentage Spliced In (PSI) while y-axis represent –log10 of p-value. B) Pie charts representing the fraction of transcripts with at least one significant excluded (left panel), invariant (middle panel) and included (right panel) exons that are bound by MATR3. P-values of MATR3 targets enrichment for each comparison is depicted below. Statistical significance was assessed with Fisher exact test.

      (iii) MATR3 binds DNA, as also shown by authors in previous studies. Is the MATR3 genomic binding altered by Charme loss in cardiomyocytes globally, as well as on the loci differentially expressed in Charme knockout heart? Overlapping MATR3 genomic binding changes and transcriptome binding changes to differentially expressed genes in the absence of Charme would better clarify the MATR3-centric mechanisms proposed here. Further connecting that to 3D genome changes due to Charme loss could provide needed clarity to the mechanistic model proposed here.

      Previous experience from our (Desideri et al., 2020) and other labs (Zeitz et al 2009 J Cell Biochem), indicate that Chromatin IP is not the most suitable approach for identifying MATR3 specific targets because of the broad distribution of MATR3 over the genome. Given the number of animals that would need to be sacrificed, we moved further to strengthen our MATR3 CLIP evidence by adding the i) CharmeKO MATR3 CLIP-seq control and the ii) combinatorial analysis of MATR3 CLIP-seq with the RNA-seq data.

      We have better explained the reasoning within the text, which now reads “The known ability of MATR3 to interact with both DNA and RNA and the high retention of pCharme on the chromatin may predict the presence of chromatin and/or specific transcripts within these MATR3-enriched condensates. In skeletal muscle cells, we have previously observed on a genome-wide scale, a global reduction of MATR3 chromatin binding in the absence of pCharme (Desideri et al., 2020). Nevertheless, the broad distribution of the protein over the genome made the identification of specific targets through MATR3-ChIP challenging.” (lines 274-279).

      Indeed, we found that MATR3 binding was significantly decreased on numerous peaks (434/626), while its increase was observed on a smaller fraction of regions (192/626) (Revised Figure 5C). As a control, we performed MATR3 motif enrichment analysis on the differentially bound regions revealing its proximity to the peak summit (+/- 50 nt) (Revised Figure 5-figure supplement 1D) close to the strongest enrichment of MATR3, further confirming a direct and highly specific binding of the protein to these sites. To better characterise the relationship between MATR3 and pCharme, we then intersected the newly identified regions with the MATR3-bound transcripts whose expression was altered by Charme depletion. While gain peaks were equally distributed across DEGs, loss peaks were significantly enriched in a subset of pCharme down-regulated DEGs (Revised Figure 5D), suggesting a crosstalk between the lncRNA and the protein in regulating the expression of this specific group of genes. Interestingly, these RNAs mainly distribute across the same GO categories as pCharme downregulated DEGs and include genes, such as Cacna1c, Notch3, Myo18B and Rbm20 involved in embryo development and validated as pCharme/Matr3 targets in primary cardiac cells (Revised Figure 5D, lower panel and 5E)

    1. Author Response

      Reviewer #1 (Public Review):

      The role of the parietal (PPC), the retrospenial (RSP) and the the visual cortex (S1) was assessed in three tasks corresponding a simple visual discrimination task, a working-memory task and a two-armed bandit task all based on the same sensory-motor requirements within a virtual reality framework. A differential involvement of these areas was reported in these tasks based on the effect of optogenetic manipulations. Photoinhibition of PPC and RSP was more detrimental than photoinhibition of S1 and more drastic effects were observed in presumably more complex tasks (i.e. working-memory and bandit task). If mice were trained with these more complex tasks prior to training in the simple discrimination task, then the same manipulations produced large deficits suggesting that switching from one task to the other was more challenging, resulting in the involvement of possibly larger neural circuits, especially at the cortical level. Calcium imaging also supported this view with differential signaling in these cortical areas depending on the task considered and the order to which they were presented to the animals. Overall the study is interesting and the fact that all tasks were assessed relying on the same sensory-motor requirements is a plus, but the theoretical foundations of the study seems a bit loose, opening the way to alternate ways of interpreting the data than "training history".

      1) Theoretical framework:

      The three tasks used by the authors should be better described at the theoretical level. While the simple task can indeed be considered a visual discrimination task, the other two tasks operationally correspond to a working-memory task (i.e. delay condition which is indeed typically assessed in a Y- or a T-maze in rodent) or a two-armed bandit task (i.e. the switching task), respectively. So these three tasks are qualitatively different, are therefore reliant on at least partially dissociable neural circuits and this should be clearly analyzed to explain the rationale of the focus on the three cortical regions of interest.

      We are glad to see that the reviewer finds our study interesting overall and sees value in the experimental design. We agree that in the previous version, we did not provide enough motivation for the specific tasks we employed and the cortical areas studied.

      Navigating to reward locations based on sensory cues is a behavior that is crucial for survival and amenable to a head-fixed laboratory setting in virtual reality for mice. In this context of goal-directed navigation based on sensory cues, we chose to center our study on posterior cortical association areas, PPC and RSC, for several reasons. RSC has been shown to be crucial for navigation across species, poised to enable the transformation between egocentric and allocentric reference frames and to support spatial memory across various timescales (Alexander & Nitz, 2015; Fischer et al., 2020; Pothuizen et al., 2009; Powell et al., 2017). It furthermore has been shown to be involved in cognitive processes beyond spatial navigation, such as temporal learning and value coding (Hattori et al., 2019; Todd et al., 2015), and is emerging as a crucial region for the flexible integration of sensory and internal signals (Stacho & ManahanVaughan, 2022). It thus is a prime candidate area in the study of how cognitive experience may affect cortical involvement in goal-directed navigation.

      RSC is heavily interconnected with PPC, which is generally thought to convert sensory cues into actions (Freedman & Ibos, 2018) and has been shown to be important for navigation-based decision tasks (Harvey et al., 2012; Pinto et al., 2019). Specific task components involving short-term memory have been suggested to cause PPC to be necessary for a given task (Lyamzin & Benucci, 2019), so we chose such task components in our complex tasks to maximize the likelihood of large PPC involvement to compare the simple task to.

      One such task component is a delay period between cue and the ultimate choice report, which is a common design in decision tasks (Goard et al., 2016; Harvey et al., 2012; Katz et al., 2016; Pinto et al., 2019). We agree with the reviewer that traditionally such a task would be referred to as a workingmemory task. However, we refrain from using this terminology because it may cause readers to expect that to solve the task, mice use a working-memory dependent strategy in its strictest and most traditional sense, that is mice show no overt behaviors indicative of the ultimate choice until the end of the delay period. If the ultimate choice is apparent earlier, mice may use what is sometimes referred to as an embodiment-based strategy, which by some readers may be seen as precluding working memory. Indeed, in new choice-decoding analyses from the mice’s running patterns, we show that mice start running towards the side of the ultimate choice during the cue period already (Figure 1—figure supplement 1). Regardless of these seemingly early choices, however, we crucially have found much larger performance decrements from inhibition in mice performing the delay task compared to mice performing the simple task, along with lower overall task performance in the delay task, indicating that the insertion of a delay period increased subjective task difficulty. As traditional working-memory versus embodiment-based strategies are not the focus of our study here and do not seem to inform the performance decrements from inhibition, we chose to label the task descriptively with the crucial task parameter rather than with the supposedly underlying cognitive process.

      For the switching task, we appreciate that the reviewer sees similarities to a two-armed bandit task. However, in a two-armed bandit task, rewards are typically delivered probabilistically, whereas in our task, cue and action values are constant within each of the two rule blocks, and only the rule, i.e. the cuechoice association, reverses across blocks. This is a crucial distinction because in our design, blocks of Rule A in the switching task are identical to the simple task, with fixed cue-choice associations and guaranteed reward delivery if the correct choice is made, allowing a fair comparison of cortical involvement across tasks.

      We have now heavily revised the introduction, results, and discussion sections of the manuscript to better explain the motivation for the tasks and the investigated brain areas. These revisions cover all the points mentioned in this response.

      Furthermore, we agree with the reviewer that the three tasks are qualitatively different and likely depend on at least partially dissociable circuits. We consider the large differences in cortical inhibition effects between the simple and the complex tasks as evidence for this notion. We also want to highlight that in fact, we performed task-specific optogenetic manipulations presented in the Supplementary Material to further understand the involvement of different areas in task-specific processes. In what is now Figure 1—figure supplement 4, we restricted inhibition in the delay task to either the cue period only or delay period only, finding that interestingly, PPC or RSC inhibition during either period caused larger performance drops than observed in the simple task. We also performed epoch-specific inhibition of PPC in the switching task, targeting specifically reward and inter-trial-interval periods following rule switches, in what is now Figure 1—figure supplement 5. With such PPC inhibition during the ITI, we observed no effect on performance recovery after rule switches and thus found PPC activity to be dispensable for rule updates.

      For the working-memory task we do not know the duration of the delay but this really is critical information; per definition, performance in such a task is delay-dependent, this is not explored in the paper.

      We thank the reviewer for pointing out the lack of information on delay duration and have now added this to the Methods section.

      We agree that in classical working memory tasks where the delay duration is purely defined by the experimenter and varied throughout a session, performance is typically dependent on delay duration. However, in our delay task, the delay distance is kept constant, and thus the delay is not varied by the experimenter. Instead, the time spent in the delay period is determined by the mouse, and the only source of variability in the time spent in the delay period is minor differences in the mice’s running speeds across trials or sessions. Notably, the differences in time in the delay period were greatest between mice because some mice ran faster than others. Within a mouse, the time spent in the delay period was generally rather consistent due to relatively constant running speeds. Also, because the mouse had full control over the delay duration, it could very well speed up its running if it started to forget the cue and run more slowly if it was confident in its memory. Thus, because the delay duration was set by the mouse and not the experimenter, it is very challenging or impossible to interpret the meaning and impact of variations in the delay duration. Accordingly, we had no a priori reason to expect a relationship between task performance and delay duration once mice have become experts at the delay task. Indeed, we do not see such a relationship in our data (see plot here, n = 85 sessions across 7 mice). In order to test the effect of delay duration on behavioral performance, we would have to systematically change the length of the delay period in the maze, which we did not do and which would require an entirely new set of experiments.

      Also, the authors heavily rely on "decision-making" but I am genuinely wondering if this is at all needed to account for the behavior exhibited by mice in these tasks (it would be more accurate for the bandit task) as with the perspective developed by the authors, any task implies a "decision-making" component, so that alone is not very informative on the nature of the cognitive operations that mice must compute to solve the tasks. I think a more accurate terminology in line with the specific task considered should be employed to clarify this.

      We acknowledge that the previous emphasis on decision-making may have created expectations that we demonstrate effects that are specific to the ‘decision-making’ aspect of a decision task. As we do not isolate the decision-making process specifically, we have substantially revised our wording around the tasks and removed the emphasis on decision-making, including in the title. Rather than decision-making, we now highlight the navigational aspect of the tasks employed.

      The "switching"/bandit task is particularly interesting. But because the authors only consider trials with highest accuracy, I think they are missing a critical component of this task which is the balance between exploiting current knowledge and the necessity to explore alternate options when the former strategy is no longer effective. So trials with poor performance are thus providing an essential feedback which is a major drive to support exploratory actions and a critical asset of the bandit task. There is an ample literature documenting how these tasks assess the exploration/exploitation trade-off.

      We completely agree with the reviewer that the periods following rule switches are an essential part of the switching task and of high interest. Indeed, ongoing work in the lab is carefully quantifying the mice’s strategy in this task and exploring how mice use errors after switches to update their belief about the rule. In this project, however, a detailed quantification of switching task strategy seemed beyond the scope because our focus was on training history and not on the specifics of each task. While we agree with the reviewer about the interesting nature of the switching period, it would be too much for a single paper to investigate the detailed mechanisms of each task on top of what we already report for training history. Instead, we have now added quantifications of performance recovery after rule switches in Figure 1— figure supplement 2, showing that rule switches cause below-chance performance initially, followed by recovery within tens of trials.

      2) Training history vs learning sets vs behavioral flexibility:

      The authors consider "training history" as the unique angle to interpret the data. Because the experimental setup is the same throughout all experiments, I am wondering if animals are just simply provided with a cognitive challenge assessing behavioral flexibility given that they must identify the new rule while restraining from responding using previously established strategies. According to this view, it may be expected for cortical lesions to be more detrimental because multiple cognitive processes are now at play.

      It is also possible that animals form learning sets during successive learning episodes which may interfere with or facilitate subsequent learning. Little information is provided regarding learning dynamics in each task (e.g. trials to criterion depending on the number of tasks already presented) to have a clear view on that.

      We thank the reviewer for raising these interesting ideas. We have now evaluated these ideas in the context of our experimental design and results. One of the main points to consider is that for mice transitioned from either of the complex tasks to the simple task, the simple task is not a novel task, but rather a well-known simplification of the previous tasks. Mice that are experts on the delay task have experienced the simple task, i.e. trials without a delay period, during their training procedure before being exposed to delay periods. Switching task expert mice know the simple task as one rule of the switching task and have performed according to this rule in each session prior to the task transition. Accordingly, upon to the transition to the simple task, both delay task expert mice and switching task expert mice perform at very high levels on the very first simple task session. We now quantify and report this in Figure 2—figure supplement 1 (A, B). This is crucial to keep in mind when assessing ‘learning sets’ or ‘behavioral flexibility’ as possible explanations for the persistent cortical involvement after the task transitions. In classical learning sets paradigms, animals are exposed to a series of novel associations, and the learning of previous associations speeds up the learning of subsequent ones (Caglayan et al., 2021; Eichenbaum et al., 1986; Harlow, 1949). This is a distinct paradigm from ours because the simple task does not contain novel associations that are new to the mice already trained on the complex tasks. Relatedly, the simple task is unlikely to present a challenge of behavioral flexibility to these mice given our experimental design and the observation of high simple task performance in the first session after the task transition.

      We now clarify these points in the introduction, results, and discussion sections, also acknowledging that it will be of interest for future work to investigate how learning sets may affect cortical task involvement.

      3) Calcium imaging data versus interventions:

      The value of the calcium imaging data is not entirely clear. Does this approach bring a new point to consider to interpret or conclude on behavioral data or is it to be considered convergent with the optogenetic interventions? Very specific portions of behavioral data are considered for these analyses (e.g. only highly successful trials for the switching/bandit task) and one may wonder if considering larger or different samples would bring similar insights. The whole take on noise correlation is difficult to apprehend because of the same possible interpretation issue, does this really reflect training history, or that a new rule now must be implemented or something else? I don't really get how this correlative approach can help to address this issue.

      We thank the reviewer for pointing out that the relationship between the inhibition dataset and calcium imaging dataset is not clear enough. We restricted analyses of inhibition and calcium imaging data in the switching task to the identical cue-choice associations as present in the simple task (i.e. Rule A trials of the switching task). We did this because we sought to make the fairest and most convincing comparison across tasks for both datasets. However, we can now see that not reporting results with trials from the other rule causes concerns that the reported differences across tasks may only hold for a specific subset of trials.

      We have now added analyses of optogenetic inhibition effects and calcium imaging results considering Rule B trials. In Figure 1—figure supplement 2, we show that when considering only Rule B trials in the switching task, effects of RSC or PPC inhibition on task performance are still increased relative to the ones observed in mice trained on and performing the simple task. We also show that overall task performance is lower in Rule B trials of the switching task than in the simple task, mirroring the differences across tasks when considering Rule A trials only.

      We extended the equivalent comparisons to the calcium imaging dataset, only considering Rule B trials of the switching task in Figure 4—figure supplement 3. With Rule B trials only, we still find larger mean activity and trial-type selectivity levels in RSC and PPC, but not in V1, compared to the simple task, as well as lower noise correlations. We thus find that our conclusions about area necessity and activity differences across tasks hold for Rule B trials and are not due to only considering a subset of the switching task data.

      In Figure 4—figure supplement 4, we further leverage the inclusion of Rule B trials and present new analyses of different single-neuron selectivity categories across rules in the switching task, reporting a prevalence of mixed selectivity in our dataset.

      Furthermore, to clarify the link between the optogenetic inhibition and the calcium imaging datasets, we have revised the motivation for the imaging dataset, as well as the presentation of its results and discussion. Investigating an area’s neural activity patterns is a crucial first step towards understanding how differential necessity of an area across tasks or experience can be explained mechanistically on a circuit level. We now elaborate on the fact that mechanistically, changes in an area’s necessity may or may not be accompanied by changes in activity within that area, as previous work in related experimental paradigms has reported differences in necessity in the absence of differences in activity (Chowdhury & DeAngelis, 2008; Liu & Pack, 2017). This phenomenon can be explained by differences in the readout of an area’s activity. We now make more explicit that in contrast to the scenario where only the readout changes, we find an intriguing correspondence between increased necessity (as seen in the inhibition experiments) and increased activity and selectivity levels (as seen in the imaging experiments) in cortical association areas depending on the current task and previous experience. Rather than attributing the increase in necessity solely to these observed changes in activity, we highlight that in the simple task condition already, cortical areas contain a high amount of task information, ruling out the idea that insufficient local information would cause the small performance deficits from inhibition. Our results thus suggest that differential necessity across tasks and experience may still require changes at the readout level despite changes in local activity. We view our imaging results as an exciting first step towards a mechanistic understanding of how cognitive experience affects cortical necessity, but we stress that future work will need to test directly the relationship between cortical necessity and various specific features of the neural code.

      Reviewer #2 (Public Review):

      The authors use a combination of optogenetics and calcium imaging to assess the contribution of cortical areas (posterior parietal cortex, retrosplenial cortex, S1/V1) on a visual-place discrimination task. Headfixed mice were trained on a simple version of the task where they were required to turn left or right depending on the visual cue that was present (e.g. X = go left; Y = go right). In a more complex version of the task the configurations were either switched during training or the stimuli were only presented at the beginning of the trial (delay).

      The authors found that inhibiting the posterior parietal cortex and retrosplenial cortex affected performance, particularly on the complex tasks. However, previous training on the complex tasks resulted in more pronounced impairments on the simple task than when behaviourally naïve animals were trained/tested on a simple task. This suggests that the more complex tasks recruit these cortical areas to a greater degree, potentially due to increased attention required during the tasks. When animals then perform the simple version of the task their previous experience of the complex tasks is transferred to the simple task resulting in a different pattern of impairments compared to that found in behaviorally naïve animals.

      The calcium imaging data showed a similar pattern of findings to the optogenetic study. There was overall increased activity in the switching tasks compared to the simple tasks consistent with the greater task demands. There was also greater trial-type selectivity in the switching task compared to the simple task. This increased trial-type selectivity in the switching tasks was subsequently carried forward to the simple task so that activity patterns were different when animals performed the simple task after experiencing the complex task compared to when they were trained on the simple task alone

      Strengths:

      The use of optogenetics and calcium-imaging enables the authors to look at the requirement of these brain structures both in terms of necessity for the task when disrupted as well as their contribution when intact.

      The use of the same experimental set up and stimuli can provide a nice comparison across tasks and trials.

      The study nicely shows that the contribution of cortical regions varies with task demands and that longerterm changes in neuronal responses c can transfer across tasks.

      The study highlights the importance of considering previous experience and exposure when understanding behavioural data and the contribution of different regions.

      The authors include a number of important controls that help with the interpretation of the findings.

      We thank the reviewer for pointing out these strengths in our work and for finding our main conclusions supported.

      Weaknesses:

      There are some experimental details that need to be clarified to help with understanding the paper in terms of behavior and the areas under investigation.

      The use of the same stimuli throughout is beneficial as it allows direct comparisons with animals experiencing the same visual cues. However, it does limit the extent to which you can extrapolate the findings. It is perhaps unsurprising to find that learning about specific visual cues affects subsequent learning and use of those specific cues. What would be interesting to know is how much of what is being shown is cue specific learning or whether it reflects something more general, for example schema learning which could be generalised to other learning situations. If animals were then trained on a different discrimination with different stimuli would this previous training modify behavior and neural activity in that instance. This would perhaps be more reflective of the types of typical laboratory experiments where you may find an impairment on a more complex task and then go on to rule out more simple discrimination impairments. However, this would typically be done with slightly different stimuli so you don't introduce transfer effects.

      We agree with the reviewer that investigating the effects of schema learning on cortical task involvement is an exciting future direction and have now explicitly mentioned this in the Discussion section. As the reviewer points out, however, our study was not designed to test this idea specifically. Because investigating schema learning would require developing and implementing an entirely new set of behavioral task variants, we feel this is beyond the scope of the current work. As to the question of how generalized the effects of cognitive experience are, our data in the run-to-target task suggest that if task settings are sufficiently distinct, cortical involvement can be similarly low regardless of complex task experience (now Figure 3—figure supplement 1). This finding is in line with recent work from (Pinto et al., 2019), where cortical involvement appears to change rapidly depending on major differences in task demands. However, work in MT has shown that previous motion discrimination training using dots can alter MT involvement in motion discrimination of gratings (Liu & Pack, 2017), highlighting that cortical involvement need not be tightly linked to the sensory cue identity.

      It is not clear whether length of training has been taken into account for the calcium imaging study given the slow development of neural representations when animals acquire spatial tasks.

      We apologize that the training duration and the temporal relationship between task acquisition and calcium imaging was not documented for the calcium imaging dataset. Please see our detailed reply below the ‘recommendations for the authors’ from Reviewer 2 below.

      The authors are presenting the study in terms of decision-making, however, it is unclear from the data as presented whether the findings specifically relate to decision making. I'm not sure the authors are demonstrating differential effects at specific decision points.

      We understand that the previous emphasis on decision-making may have created expectations that we demonstrate effects that are specific to the ‘decision-making’ aspect of a decision task. As we do not isolate the decision-making process specifically, we have substantially revised our wording around the tasks and removed the emphasis on decision-making, including in the title. Rather than decision-making, we now highlight the navigational aspect of the tasks employed.

      While we removed the emphasis on the decision-making process in our tasks, we found the reviewer’s suggestion to measure ‘decision points’ a useful additional behavioral characterization across tasks. So, we quantified how soon a mouse’s ultimate choice can be decoded from its running pattern as it progresses through the maze towards the Y-intersection. We now show these results in Figure 1—figure supplement 1. Interestingly, we found that in the delay task, choice decoding accuracy was already very high during the cue period before the onset of the delay. Nevertheless, we had shown that overall task performance and performance with inhibition were lower in the delay task compared to the simple task. Also, in segment-specific inhibition experiments, we had found that inhibition during only the delay period or only the cue period decreased task performance substantially more than in the simple task, thus finding an interesting absence of differential inhibition effects around decision points. Overall, how early a mouse made its ultimate decision did not appear predictive of the inhibition-induced task decrements, which we also directly quantify in Figure 1—figure supplement 1.

    1. Author Response

      Reviewer 2 (Public Review):

      1) The authors developed a novel C.elegans model for studying extracellular amyloid beta aggregation and is therefore likely to be taken up broadly by the field. However, the new model should be fully characterized. Throughout the manuscript, the only method to detect amyloid deposition was the GFP fluorescence intensity and morphology, while direct characterization of amyloid aggregates is lacking.

      We thank the reviewer for the feedback and the foresight that this model might be taken up by the field. To strengthen our model, as the reviewer had suggested, we confirmed that the GFP fluorescence is indeed amyloid aggregations. Please, see point 3 above and the new Supporting Figure 1.1.

      2) A targeted RNA interference (RNAi) screen was used to identify the key regulators of Aβ aggregation and clearance, which is one of the strengths of the study. There should be evidence that RNAi works to knockdown the specific genes. Similarly, there should be evidence indicating that ADM-2 is indeed expressed in the overexpression experiments.

      We aimed to verify our main hits (cri-2 and adm-2) with a mutation in these genes, as RNAi can have off-target effects. The adm-2(ok3178) allele is a 989 bp deletion leading to a splice/acceptor change leading to a probably truncated and out-of-frame protein.

      Author response image 1.

      The cri-2(gk314) allele is a 1213 bp deletion covering the whole cri-2 locus, suggesting to be a null allele.

      Author response image 2.

      For the overexpression, there is no ADM-2 antibody available. We tried to generate an ADM-2 antibody, unfortunately unsuccessfully. Thus, we can only, based on the induction and higher red fluorescence of ADM-2::mScarlet (Supporting Figure 6.1.) infer the ADM-2 overexpression.

      3) It remains unknown whether ADM-2 directly degrades Aβ or facilitates the clearance of Aβ by remoulding the ECM. The effect of ADM-2 on ECM remodeing should be examined.

      We addressed this in point 1 above and also in our discussion section.

    1. Author Response

      Reviewer #1 (Public Review):

      In this paper, Bai et al. investigate in experiments and simulations how cohesion is maintained in chemotactic travelling waves of bacteria. These waves emerge from the bacterial population consuming an attractant, thus carving a gradient which they follow chemotactically. This paper builds up on previous work of some of the authors (Fu et al, Nat Commun 2018), which found that in these waves bacteria with varying degree of chemotactic sensitivity organize spatially in the band, which allows for its cohesiveness despite varying phenotypes. The authors investigate here an additional element for the cohesiveness of the wave: because the sharpness of the gradient increases from the front to the back of the wave, 'late' cells catch up via a stronger chemotactic response, and front cells slow down via a weaker one. This had been already postulated in earlier work on the phenomenon (Saragosti et al. PNAS 2011), but here the authors investigate how this applies to cells with varying chemotactic sensitivity. They also performed agent-based simulations of the cells behavior in the gradient and developed a model of the motion in the gradient. The latter maps the spatial dependence of the gradient steepness onto an effective travelling potential which keeps the cells together in a group as the gradient and the wave propagate. Importantly, the effective potential is predicted to be tighter for cells with higher chemotactic sensitivity, in agreement with the cell behavior they observe in experiments where the chemotactic sensitivity is artificially modulated. This suggests that weakly chemotactic cells are more weakly bound to the group and have a higher chance of being left behind. This last part is interesting in the context of range extension in semi-solid agar, where bacteria are known to be spatially organized and selected according to their chemotactic motility (Ni et al, Cell reports 2017, Liu et al Nature 2019)

      This paper builds its strengths on the extensive experimental characterization of the system and a variety of modeling approaches and makes a fairly convincing case for the way of understanding the mechanism of cohesion maintenance they propose.

      In fact, we have addressed both the mechanism to maintain a coherent group and also the mechanism to form ordered pattern of diverse phenotypes. Thanks to the reviewer, we noticed that the second point was not clearly showed out in our previous version. So that we have largely rewritten the texts and reorganized the results to prominent both mechanism.

      From a methodological perspective, only a few points need to be addressed:

      Control experiments need to quantify the cell-to-cell variability of the induction level of Tar by tetracycline.

      The distributions of the titrate cells are presented by a ptet-Tar-GFP strain, where the GFP is used as a reporter of the expressed Tar protein. The results are shown below:

      Chemical attraction to cues released by other cells is a well-documented way to create cohesive large scale structures in E. coli (Budrene & Berg Nature 1995, Park et al PNAS 2003, Jani et al Microbiology 2017, Laganenka et al Nat commun 2016). The cohesion of the wave have never been analyzed in this optic, despite being a possible alternative explanation to the gradient shape. Since the authors main claim is about the wave cohesion, they should provide evidence that such an explanation can be ruled out or considered secondary.

      We thank the reviewer to point out the self-attractant secretion as a possible mechanism to maintain coherent group. We argue that this mechanism is not necessary for the chemotactic group to maintain coherency, because the migration group keeps without considering these effect in our agent based simulations.

      Moreover, as suggested by the reviewer, we Used a Tar only strain, which do not sense any chemo-attractant other than aspartate, to show that the migration group maintained coherent (see Fig S9). This experiment showed that the secretion of self-attractant is not essential for the coherent group migration.

      Possible effects of physical interactions between cells on the chemotactic response are not accounted for. The consequences should be better discussed, because they are known to influence chemotactic motility at the densities encountered in the present experiments (Colin et al Nat commun 2019).

      As being reported by Colin et al., the effective drift velocity and the chemotactic ability deceases when cells are condensed (volume fraction >0.01). However, the cell density is smaller than this critical value (volume fraction<0.01).

      Additionally, the paper could better emphasize the new results and separate them from the confirmations of previous results.

      In the revised version, we addressed 2 new findings:

      1) The individual drift velocity decreases from back to front of the bacterial migration group, which makes the chemotactic migration wave a pushed wave.

      2) Cells of diversed phenotypes follows the same reversion behavior, ie. drift faster in the back and slower in the front, but with ordered mean positions, to achieve the ordered pattern in the migration group.

      Reviewer #2 (Public Review):

      The manuscript by Bai et al. explores the single-cell motility dynamics within a chemotactic soliton wave in E. coli. They tracked individual cells and measured their trajectory speed and orientation distributions behind and ahead of the wave. They showed cells behind the wave were moving in a more directed fashion towards the center of the wave compared to cells ahead of the wave. This behavior explains the stability of group migration, as confirmed by numerical simulations.

      I do not recommend this manuscript for publication in eLife since it basically reproduces and deepens previous published works. In particular, Saragosti et al (2011) already provided exactly what the authors claim to do here : "How individuals with phenotypic and behavioral variations manage to maintain the consistent group performance and determine their relative positions in the group is still a mystery." (Line 75-77) (See the last sentences from Saragosti et al : "This modulation of the reorientations significantly improves the efficiency of the collective migration. Moreover, these two quantities are spatially modulated along the concentration profile. We recover quantitatively these microscopic and macroscopic observations with a dedicated kinetic model.")

      Saragosti et al.talks about the modulation of reorientation angle of bacteria along directions. It is not equal to the spatial modulation of drift velocities along space. They claim that cells moving along the gradient direction reorient less during a tumble than cells moving against the gradient. This phenomenon increases the migration efficiency of the group. Here, in our paper, we claim that the drift velocity of bacteria is spatially modulated, where cells on the back drifts faster while the cells in the front drift slower. This phenomenon is important because it makes the chemotactic migration front a pushed wave, that helps the group to keep diversed phenotypes.

      Although Saragosti et al. Have also suggested spatial modulation of bias in run length to explain the coherency of the migration group. But they did not quantify such bias nor did they explain the causes and consequences of the spatial modulation. More over, Their model, consisting their proposed mechanism of directional persistence, can not explain their observed phenomenon of the decreasing bias of run length (see their figure 4A and C).In this circumstance, we can’t agree that they already proofed how cells with diversed phenotype to maintain coherent group.

      Moreover, they did not talk about diversities in the group.

      What is novel here is the titration of the behavior with chemo-receptor abundance, but I believe the scope is not wide enough for publication in eLife. I suggest the authors to submit in a more specialized journal.

      The titration of the chemo-receptor abundance of bacteria serves as a tool to explain how diverse individuals manage to form the ordered patterns in a group. This question worth several discussion because diversity is known as an important feature to keep a group to survive. The ordered pattern was found the key for a migrating group to keep the diversity while performing consistent migration speed. In this paper we successfully explained how individuals performing biased random walk are able to form ordered structure.

      Reviewer #3 (Public Review):

      The authors present a study on the collective behaviour of E.coli during migration in a self-generated gradient. Taking into account phenotypic variation within a biological population, they performed experiments and complemented the study with a predictive model used for simulation to understand how bacteria can move as a group and how the individual bacterium defines its own position within the group.

      They observed experimentally that phenotype variation within the bacterial population causes a spatial distribution within the chemotactic band that is not continuous but formed by subpopulations with specific properties such as run length, run duration, angular distribution of trajectories, drift velocity. They attribute this behaviour to the chemotaxis ability, which varies between phenotypes and defines a potential well that anchors each bacterium in its own group. This was proven by the subdiffusive dynamics of the bacteria in each subgroup. Many cases were studied in the experiments and the authors present many controls to clearly demonstrate their hypothesis.

      These are interesting results that prove how a discretised distribution can produce continuous collective behaviour. It presents also an interesting example in the field of active matter about collective behaviour on a large scale that is generated by a different behaviour of individuals on a much smaller scale. However, it is not clear how the subpopulations can be held together in the group.

      The decreasing chemo-attractant gradient makes the migration wavefront a pushed wavefront. So that the balanced position of the subpopulation with larger chemotactic ability is located in the front where the gradient is small. So that diverse phenotypes form ordered pattern to achieve identical migration speed on their balanced positions. This discussion was added in the revised text (see line 268-277).

      Moreover, a link between bacterial dynamics and the biological necessary mechanism is not clear.

      The bacterial individual dynamics is controlled by the bacterial chemotaxis pathway, which is clear according to previous studies. Basically, the biased random motion was controlled by alternating expected run length through a temporal comparison mechanism between received chemo-attractant concentrations.(Jiang et al. 2010 Plos Comp. Biol.)

      They formulate a theoretical description based on the classical Keller-Segel model. Langevin dynamics was used to describe bacterial activity in terms of drift velocity for simulation, which agrees very well with experimental observations.

      One can appreciate the interesting results of the study describing Ecoli chemotaxis as a mean-reversion process with an associated potential, but it is not clear to what extent the results can be generalised to all bacteria or rather relate to the strain the authors investigated.

      The mean reversion process is a result of decreasing drift velocity (or a pushed wave). Although our study focuses on bacterail chemotaxis migration, but the ordering mechanism of diversed phenotypes follows a OU type model, which is not limited to bacterial chemotaxis. In this case, we argue that the ordering mechanism that we proposed is universal to all active particles that generate signals as a global cue of collective motion.

    1. Author Response

      Reviewer #2 (Public Review):

      The time-dependency of the model simulations was not analyzed, and the nature of the observed biphasic time-dependent APAP response remains elusive. It would be interesting to see how the model can explain the time course of the APAP stimulation experiment.

      The alternative model at its current state can only describe steady state conditions. However, we understand that the reviewer is interested in the dynamic behavior of the model. However, our approach provides a proof of principle that the alternative model can phenomenologically explain the changes of YAP localization as a response to APAP treatment. The question of how to model Hippo pathway in a time-dependent manner as a response to APAP treatment is very challenging and would require further investigations and, most notably, further development of the PDE simulation algorithms and the SME software. Hence, a technical update of the software algorithms would be required, which cannot be in the scope of this manuscript.

      Nevertheless, we decided to share our first and preliminary analyses on dynamic processes caused by APAP with the reviewer. For this, we simulated the steady state model in an arbitrary manner, where APAP initiates (early time-point) and slows down (late time-points) YAP phosphorylation in the nucleus (see Figure below).

      The simulated alternative model shows that increased YAP phosphorylation about 50% leads to the cytoplasmic localization of YAP (Rebuttal Figure R5A/B). However, this shuttling is not detectable in our protein fractionation and live-cell imaging experiments (see also Rebuttal Figure R7C/D). At late time points, decreasing YAP phosphorylation (about 60%) led to a clear nuclear enrichment and dephosphorylation of YAP was observed in our experiments. Thus, our mathematical model nicely describes cellular events of Hippo pathway dynamics observed at later stages after APAP treatment (nuclear enrichment). However, early events cannot be completely explained (suggested nuclear YAP exclusion is not detectable).

      We suggest two explanations for this observation. First, other molecular mechanisms (not yet identified and therefore not part of the model topology) oppose the exclusion YAP enrichment that is expected at early time points. Second, detection methods used in this study (Western Blotting and life cell imaging) cannot capture minimal changes and cellular heterogeneity in the chosen experimental setup. We clarify this aspect/limitation of our study in the discussion chapter of the manuscript. Page 12, lines 436-440

      Time-dependency of YAP (orange) localization based on the simulated APAP treatment. (A): Simulated control (ctrl) and APAP treatment for 2 and 48h. The treatment was simulated by changing the phosphorylation coefficient of YAP in the nucleus. (B): Simulated pYAP/YAP ratio during control and APAP treatment for 2 and 48 hours at the steady state of the model. (C): Simulated NCR of the total YAP during control and APAP treatment for 2 and 48 hours at the steady state.

    1. Author Response

      Reviewer #1 (Public Review):

      Because of the importance of brain and cognitive traits in human evolution, brain morphology and neural phenotypes have been the subject of considerable attention. However, work on the molecular basis of brain evolution has tended to focus on only a handful of species (i.e., human, chimp, rhesus macaque, mouse), whereas work that adopts a phylogenetic comparative approach (e.g., to identify the ecological correlates of brain evolution) has not been concerned with molecular mechanism. In this study, Kliesmete, Wange, and colleagues attempt to bridge this gap by studying protein and cis-regulatory element evolution for the gene TRNP1, across up to 45 mammals. They provide evidence that TRNP1 protein evolution rates and its ability to drive neural stem cell proliferation are correlated with brain size and/or cortical folding in mammals, and that activity of one TRNP1 cis-regulatory element may also predict cortical folding.

      There is a lot to like about this manuscript. Its broad evolutionary scope represents an important advance over the narrower comparisons that dominate the literature on the genetics of primate brain evolution. The integration of molecular evolution with experimental tests for function is also a strength. For example, showing that TRNP1 from five different mammals drives differences in neural stem cell proliferation, which in turn correlate with brain size and cortical folding, is a very nice result. At the same time, the paper is a good reminder of the difficulty of conclusively linking macroevolutionary patterns of trait evolution to molecular function. While TRNP1 is a moderate outlier in the correlation between rate of protein evolution and brain morphology compared to 125 other genes, this result is likely sensitive to how the comparison set is chosen; additionally, it's not clear that a correlation with evolutionary rate is what should be expected. Further, while the authors show that changes in TRNP1 sequence have functional consequences, they cannot show that these changes are directly responsible for size or folding differences, or that positive selection on TRNP1 is because of selection on brain morphology (high bars to clear). Nevertheless, their findings contribute strong evidence that TRNP1 is an interesting candidate gene for studying brain evolution. They also provide a model for how functional follow-up can enrich sequence-based comparative analysis.

      We thank the reviewer for the positive assessment. With respect to our set of control genes and the interpretation of the correlation between the evolution of the TRNP1 protein sequence and the evolution of brain size and gyrification, we would like to mention the following: we do think that the set is small, but we took all similarly sized genes with one coding exon that we could find in all 30 species. Furthermore, the control genes are well comparable to TRNP1 with respect to alignment quality and average omega (Figure 1-figure supplement 3). Hence, we think that the selection procedure and the actual omega distribution make them a valid, unbiased set to which TRNP1’s co-evolution with brain phenotypes can be compared to. Moreover, we want to point out that by using Coevol, we correlate evolutionary rates, that is the rate of protein evolution of TRNP1 as measured with omega and the rate of brain size evolution that is modeled in Coevol as a Brownian motion process. We think that this was unclear in the previous version of our manuscript, and appreciate that the reviewer saw some merit in our analyses in spite of it.

      Finding conclusive evidence to link molecular evolution to concrete phenotypes is indeed difficult and necessarily inferential. This said, we still believe that correlating rates of evolution of phenotype and sequence across a phylogeny is one of the most convincing pieces of evidence available.

      Reviewer #2 (Public Review):

      In this paper, Kliesmete et al. analyze the protein and regulatory evolution of TRNP1, linking it to the evolution of brain size in mammals. We feel that this is very interesting and the conclusions are generally supported, with one concern.

      The comparison of dN/dS (omega) values to 125 control proteins is helpful, but an important factor was not controlled. The fraction of a protein in an intrinsically disordered region (IDR) is potentially even more important in affecting dN/dS than the protein length or number of exons. We suggest comparing dN/dS of TRNP1 to another control set, preferably at least ~500 proteins, which have similar % IDR.

      Thank you for this interesting suggestion. As mentioned in the public response to Reviewer #1, we are sorry that we did not explain the rationale of the approach very well in the previous version of the manuscript. As also argued above, we think that our control proteins are an unbiased set as they have a comparable alignment quality and an average omega (dN/dS) similar to TRNP1 (Figure 1-figure supplement 3). While IDR domains tend to have a higher omega than their respective non-IDR counterparts, we do not think that the IDR content should be more relevant than omega itself as we do not interpret this estimate on its own, but its covariance with the rate of phenotypic change. Indeed, the proteins of our control set that have a higher IDR content (D2P2, Oates et al. 2013) do not show stronger evidence to be coevolving with the brain phenotypes (IDR content vs. absolute brain size-omega partial correlation: Kendall's tau = 0.048, p-value = 0.45; IDR content vs. absolute GI-omega partial correlation: Kendall’s tau = -0.025, p-value = 0.68; 88 proteins (71%) contain >0% IDRs; 8 proteins contain >62% (TRNP1 content) IDRs.

      Reviewer #3 (Public Review):

      In this work, Z. Kliesmete, L. Wange and colleagues investigate TRNP1 as a gene of potential interest for the evolution of the mammalian cortex. Previous evidence suggests that TRNP1 is involved in self-renewal, proliferation and expansion in cortical cells in mouse and ferret, making this gene a good candidate for evolutionary investigation. The authors designed an experimental scheme to test two non-exclusive hypotheses: first, that evolution of the TRNP1 protein is involved in the apparition of larger and more convoluted brains; and second, that regulation of the TRNP1 gene also plays a role in this process alongside protein evolution.

      The authors report that the rate of TRNP1 protein evolution is strongly correlated to brain size and gyrification, with species with larger and more convoluted brains having more divergent sequences at this gene locus. The correlation with body mass was not as strong, suggesting a functional link between TRNP1 and brain evolution. The authors directly tested the effects of sequence changes by transfecting the TRNP1 sequences from 5 different species in mouse neural stem cells and quantifying cell proliferation. They show that both human and dolphin sequences induce higher proliferation, consistent with larger brain sizes and gyrifications in these two species. Then, the authors identified six potential cis-regulatory elements around the TRNP1 gene that are active in human fetal brain, and that may be involved in its regulation. To investigate whether sequence evolution at these sites results in changes in TRNP1 expression, the authors performed a massively parallel reporter assay using sequences from 75 mammals at these six loci. The authors report that one of the cis-regulatory elements drives reporter expression levels that are somewhat correlated to gyrification in catarrhine monkeys. Consistent with the activity of this cis-regulatory sequence in the fetal brain, the authors report that this element contains binding sites for TFs active in brain development, and contains stronger binding sites for CTCF in catarrhine monkeys than in other species. However, the specificity or functional relevance of this signal is unclear.

      Altogether, this is an interesting study that combines evolutionary analysis and molecular validation in cell cultures using a variety of well-designed assays. The main conclusions - that TRNP1 is likely involved in brain evolution in mammals - are mostly well supported, although the involvement of gene regulation in this process remains inconclusive.

      Strengths:

      • The authors have done a good deal of resequencing and data polishing to ensure that they obtained high-quality sequences for the TRNP1 gene in each species, which enabled a higher confidence investigation of this locus.

      • The statistical design is generally well done and appears robust.

      • The combination of evolutionary analysis and in vivo validation in neural precursor cells is interesting and powerful, and goes beyond the majority of studies in the field. I also appreciated that the authors investigated both protein and regulatory evolution at this locus in significant detail, including performing a MPRA assay across species, which is an interesting strategy in this context.

      Weaknesses:

      • The authors report that TRNP1 evolves under positive selection, however this seems to be the case for many of the control proteins as well, which suggests that the signal is non-specific and possibly due to misspecifications in the model.

      • The evidence for a higher regulatory activity of the intronic cis-regulatory element highlighted by the authors is fairly weak: correlation across species is only 0.07, consistent with the rapid evolution of enhancers in mammals, and the correlation in catarrhine monkeys is seems driven by a couple of outlier datapoints across the 10 species. It is unclear whether false discovery rates were controlled for in this analysis.

      • The analysis of the regulatory content in this putative enhancer provides some tangential evidence but no reliable conclusions regarding the involvement of regulatory changes at this locus in brain evolution.

      We thank the reviewer for the detailed comments. Indeed, TRNP1 overall has a rather average omega value across the tree and hence also the proportion of sites under selection is not hugely increased compared to the control proteins. This is good because we want to have comparable power to detect a correlation between the rate of protein evolution (omega) and the rate of brain size or GI evolution for TRNP1 and the control proteins. Indeed, what makes TRNP1 special is the rather strong correlation between the rate of brain size change and omega, which was only stronger in 4% of our control proteins. Hence, we do not agree with the weakness of model misspecification for TRNP1 protein evolution.

      We agree that the correlation of the activity induced by the intronic cis regulatory element (CRE) with gyrification is weak, but we dispute that the correlation is due to outliers (see residual plot below) or violations of model assumptions (see new permutation analysis in the Results section). There are many reasons why we would expect such a correlation not to be weak, including that a MPRA takes the CRE out of its natural genomic context. Our conclusions do not solely rest on those statistics, but also on independent corroborating evidence: Reilly et al (2015) found a difference in the activity of the TRNP1 intron between human and macaque samples during brain development. Furthermore, we used their and other public data to show that the intron CRE is indeed active in humans and bound by CTCF (new Figure 4 - figure supplement 2).

      We believe that the combined evidence suggests a likely role for the intron CRE for the co-evolution of TRNP1 with gyrification.

    1. Author Response

      Reviewer #1 (Public Review):

      Trudel and colleagues aimed to uncover the neural mechanisms of estimating the reliability of the information from social agents and non-social objects. By combining functional MRI with a behavioural experiment and computational modelling, they demonstrated that learning from social sources is more accurate and robust compared with that from non-social sources. Furthermore, dmPFC and pTPJ were found to track the estimated reliability of the social agents (as opposed to the non-social objects). The strength of this study is to devise a task consisting of the two experimental conditions that were matched in their statistical properties and only differed in their framing (social vs. non-social). The novel experimental task allows researchers to directly compare the learning from social and non-social sources, which is a prominent contribution of the present study to social decision neuroscience.

      Thank you so much for your positive feedback about our work. We are delighted that you found that our manuscript provided a prominent contribution to social decision neuroscience. We really appreciate your time to review our work and your valuable comments that have significantly helped us to improve our manuscript further.

      One of the major weaknesses is the lack of a clear description about the conceptual novelty. Learning about the reliability/expertise of social and non-social agents has been of considerable concern in social neuroscience (e.g., Boorman et al., Neuron 2013; and Wittmann et al., Neuron 2016). The authors could do a better job in clarifying the novelty of the study beyond the previous literature.

      We understand the reviewer’s comment and have made changes to the manuscript that, first, highlight more strongly the novelty of the current study. Crucially, second, we have also supplemented the data analyses with a new model-based analysis of the differences in behaviour in the social and non-social conditions which we hope makes clearer, at a theoretical level, why participants behave differently in the two conditions.

      There has long been interest in investigating whether ‘social’ cognitive processes are special or unique compared to ‘non-social’ cognitive processes and, if they are, what makes them so. Differences between conditions could arise during the input stage (e.g. the type of visual input that is processed by social and non-social system), at the algorithm stage (e.g. the type of computational principles that underpin social versus non-social processes) or, even if identical algorithms are used, social and non-social processes might depend on distinct anatomical brain areas or neurons within brain areas. Here, we conducted multiple analyses (in figures 2, 3, and 4 in the revised manuscript and in Figure 2 – figure supplement 1, Figure 3 – figure supplement 1, Figure 4 – figure supplement 3, Figure 4 – figure supplement 4) that not only demonstrated basic similarities in mechanism generalised across social and non-social contexts, but also demonstrated important quantitative differences that were linked to activity in specific brain regions associated with the social condition. The additional analyses (Figure 4 – figure supplement 3, Figure 4 – figure supplement 4) show that differences are not simply a consequence of differences in the visual stimuli that are inputs to the two systems1, nor does the type of algorithm differ between conditions. Instead, our results suggest that the precise manner in which an algorithm is implemented differs when learning about social or non-social information and that this is linked to differences in neuroanatomical substrates.

      The previous studies mentioned by the reviewer are, indeed, relevant ones and were, of course, part of the inspiration for the current study. However, there are crucial differences between them and the current study. In the case of the previous studies by Wittmann, the aim was a very different one: to understand how one’s own beliefs, for example about one’s performance, and beliefs about others, for example about their performance levels, are combined. Here, however, instead we were interested in the similarities and differences between social and non-social learning. It is true that the question resembles the one addressed by Boorman and colleagues in 2013 who looked at how people learned about the advice offered by people or computer algorithms but the difference in the framing of that study perhaps contributed to authors’ finding of little difference in learning. By contrast, in the present study we found evidence that people were predisposed to perceive stability in social performance and to be uncertain about non-social performance. By accumulating evidence across multiple analyses, we show that there are quantitative differences in how we learn about social versus non-social information, and that these differences can be linked to the way in which learning algorithms are implemented neurally. We therefore contend that our findings extend our previous understanding of how, in relation to other learning processes, ‘social’ learning has both shared and special features.

      We would like to emphasize the way in which we have extended several of the analyses throughout the revision. The theoretical Bayesian framework has made it possible to simulate key differences in behaviour between the social and non-social conditions. We explain in our point-by-point reply below how we have integrated a substantial number of new analyses. We have also more carefully related our findings to previous studies in the Introduction and Discussion.

      Introduction, page 4:

      [...] Therefore, by comparing information sampling from social versus non-social sources, we address a long-standing question in cognitive neuroscience, the degree to which any neural process is specialized for, or particularly linked to, social as opposed to non-social cognition 2–9. Given their similarities, it is expected that both types of learning will depend on common neural mechanisms. However, given the importance and ubiquity of social learning, it may also be that the neural mechanisms that support learning from social advice are at least partially specialized and distinct from those concerned with learning that is guided by nonsocial sources. However, it is less clear on which level information is processed differently when it has a social or non-social origin. It has recently been argued that differences between social and non-social learning can be investigated on different levels of Marr’s information processing theory: differences could emerge at an input level (in terms of the stimuli that might drive social and non-social learning), at an algorithmic level or at a neural implementation level 7. It might be that, at the algorithmic level, associative learning mechanisms are similar across social and non-social learning 1. Other theories have argued that differences might emerge because goal-directed actions are attributed to social agents which allows for very different inferences to be made about hidden traits or beliefs 10. Such inferences might fundamentally alter learning about social agents compared to non-social cues.

      Discussion, page 15:

      […] One potential explanation for the assumption of stable performance for social but not non-social predictors might be that participants attribute intentions and motivations to social agents. Even if the social and non-social evidence are the same, the belief that a social actor might have a goal may affect the inferences made from the same piece of information 10. Social advisors first learnt about the target’s distribution and accordingly gave advice on where to find the target. If the social agents are credited with goal-directed behaviour then it might be assumed that the goals remain relatively constant; this might lead participants to assume stability in the performances of social advisors. However, such goal-directed intentions might not be attributed to non-social cues, thereby making judgments inherently more uncertain and changeable across time. Such an account, focussing on differences in attribution in social settings aligns with a recent suggestion that any attempt to identify similarities or differences between social and non-social processes can occur at any one of a number of the levels in Marr’s information theory 7. Here we found that the same algorithm was able to explain social and non-social learning (a qualitatively similar computational model could explain both). However, the extent to which the algorithm was recruited when learning about social compared to non-social information differed. We observed a greater impact of uncertainty on judgments about social compared to non-social information. We have shown evidence for a degree of specialization when assessing social advisors as opposed to non-social cues. At the neural level we focused on two brain areas, dmPFC and pTPJ, that have not only been shown to carry signals associated with belief inferences about others but, in addition, recent combined fMRI-TMS studies have demonstrated the causal importance of these activity patterns for the inference process […]

      Another weakness is the lack of justifications of the behavioural data analyses. It is difficult for me to understand why 'performance matching' is suitable for an index of learning accuracy. I understand the optimal participant would adjust the interval size with respect to the estimated reliability of the advisor (i.e., angular error); however, I am wondering if the optimal strategy for participants is to exactly match the interval size with the angular error. Furthermore, the definitions of 'confidence adjustment across trials' and 'learning index' look arbitrary.

      First, having read the reviewer’s comments, we realise that our choice of the term ‘performance matching’ may not have been ideal as it indeed might not be the case that the participant intended to directly match their interval sizes with their estimates of advisor/predictor error. Like the reviewer, our assumption is simply that the interval sizes should change as the estimated reliability of the advisor changes and, therefore, that the intervals that the participants set should provide information about the estimates that they hold and the manner in which they evolve. On re-reading the manuscript we realised that we had not used the term ‘performance matching’ consistently or in many places in the manuscript. In the revised manuscript we have simply removed it altogether and referred to the participants’ ‘interval setting’.

      Most of the initial analyses in Figure 2a-c aim to better understand the raw behaviour before applying any computational model to the data. We were interested in how participants make confidence judgments (decision-making per se), but also how they adapt their decisions with additional information (changes or learning in decision making). In the revised manuscript we have made clear that these are used as simple behavioural measures and that they will be complemented later by more analyses derived from more formal computational models.

      In what we now refer to as the ‘interval setting’ analysis (Figure 2a), we tested whether participants select their interval settings differently in the social compared to non-social condition. We observe that participants set their intervals closer to the true angular error of the advisor/predictor in the social compared to the non-social condition. This observation could arise in two ways. First, it could be due to quantitative differences in learning despite general, qualitative similarity: mechanisms are similar but participants differ quantitatively in the way that they learn about non-social information and social information. Second, it could, however, reflect fundamentally different strategies. We tested basic performance differences by comparing the mean reward between conditions. There was no difference in reward between conditions (mean reward: paired t-test social vs. non-social, t(23)= 0.8, p=0.4, 95% CI= [-0.007 0.016]), suggesting that interval setting differences might not simply reflect better or worse performance in social or non-social contexts but instead might reflect quantitative differences in the processes guiding interval setting in the two cases.

      In the next set of analyses, in which we compared raw data, applied a computational model, and provided a theoretical account for the differences between conditions, we suggest that there are simple quantitative differences in how information is processed in social and nonsocial conditions but that these have the important impact of making long-term representations – representations built up over a longer series of trials – more important in the social condition. This, in turn, has implications for the neural activity patterns associated with social and non-social learning. We, therefore, agree with the reviewer, that one manner of interval setting is indeed not more optimal than another. However, the differences that do exist in behaviour are important because they reveal something about the social and non-social learning and its neural substrates. We have adjusted the wording and interpretation in the revised manuscript.

      Next, we analysed interval setting with two additional, related analyses: interval setting adjustment across trials and derivation of a learning index. We tested the degree to which participants adjusted their interval setting across trials and according to the prediction error (learning index, Figure f); the latter analysis is very similar to a trial-wise learning rate calculated in previous studies11. In contrast to many other studies, the intervals set by participants provide information about the estimates that they hold in a simple and direct way and enable calculation of a trial-wise learning index; therefore, we decided to call it ‘learning index’ instead of ‘learning rate’ as it is not estimated via a model applied to the data, but instead directly calculated from the data. Arguably the directness of the approach, and its lack of dependence on a specific computational model, is a strength of the analysis.

      Subsequently in the manuscript, a new analysis (illustrated in new Figure 3) employs Bayesian models that can simulate the differences in the social and non-social conditions and demonstrate that a number of behavioural observations can arise simply as a result of differences in noise in each trial-wise Bayesian update (Figure 3 and specifically 3d; Figure 3 – figure supplement 1b-c). In summary, the descriptive analyses in Figure 2a-c aid an intuitive understanding of the differences in behaviour in the social and non-social conditions. We have then repeated these analyses with Bayesian models incorporating different noise levels and showed that in such a way, the differences in behaviour between social and non-social conditions can be mimicked (please see next section and manuscript for details).

      We adjusted the wording in a number of sections in the revised manuscript such as in the legend of Figure 2 (figures and legend), Figure 4 (figures and legend).

      Main text, page 5:

      The confidence interval could be changed continuously to make it wider or narrower, by pressing buttons repeatedly (one button press resulted in a change of one step in the confidence interval). In this way participants provided what we refer to as an ’interval setting’.

      We also adjusted the following section in Main text, page 6:

      Confidence in the performance of social and non-social advisors

      We compared trial-by-trial interval setting in relation to the social and non-social advisors/predictors. When setting the interval, the participant’s aim was to minimize it while ensuring it still encompassed the final target position; points were won when it encompassed the target position but were greater when it was narrower. A given participant’s interval setting should, therefore, change in proportion to the participant’s expectations about the predictor’s angular error and their uncertainty about those expectations. Even though, on average, social and non-social sources did not differ in the precision with which they predicted the target (Figure 2 – figure supplement 1), participants gave interval settings that differed in their relationships to the true performances of the social advisors compared to the non-social predictors. The interval setting was closer to the angular error in the social compared to the non-social sessions (Figure 2a, paired t-test: social vs. non-social, t(23)= -2.57, p= 0.017, 95% confidence interval (CI)= [-0.36 -0.4]). Differences in interval setting might be due to generally lower performance in the nonsocial compared to social condition, or potentially due to fundamentally different learning processes utilised in either condition. We compared the mean reward amounts obtained by participants in the social and non-social conditions to determine whether there were overall performance differences. There was, however, no difference in the reward received by participants in the two conditions (mean reward: paired t-test social vs. non-social, t(23)= 0.8, p=0.4, 95% CI= [-0.007 0.016]), suggesting that interval setting differences might not simply reflect better or worse performance

      Discussion, page 14:

      Here, participants did not match their confidence to the likely accuracy of their own performance, but instead to the performance of another social or non-social advisor. Participants used different strategies when setting intervals to express their confidence in the performances of social advisors as opposed to non-social advisors. A possible explanation might be that participants have a better insight into the abilities of social cues – typically other agents – than non-social cues – typically inanimate objects.

      As the authors assumed simple Bayesian learning for the estimation of reliability in this study, the degree/speed of the learning should be examined with reference to the distance between the posterior and prior belief in the optimal Bayesian inference.

      We thank the reviewer for this suggestion. We agree with the reviewer that further analyses that aim to disentangle the underlying mechanisms that might differ between both social and non-social conditions might provide additional theoretical contributions. We show additional model simulations and analyses that aim to disentangle the differences in more detail. These new results allowed clearer interpretations to be made.

      In the current study, we showed that judgments made about non-social predictors were changed more strongly as a function of the subjective uncertainty: participants set a larger interval, indicating lower confidence, when they were more uncertain about the non-social cue’s accuracy to predict the target. In response to the reviewer’s comments, the new analyses were aimed at understanding under which conditions such a negative uncertainty effect might emerge.

      Prior expectations of performance First, we compared whether participants had different prior expectations in the social condition compared to the non-social condition. One way to compare prior expectations is by comparing the first interval set for each advisor/predictor. This is a direct readout of the initial prior expectation with which participants approach our two conditions. In such a way, we test whether the prior beliefs before observing any social or non-social information differ between conditions. Even though this does not test the impact of prior expectations on subsequent belief updates, it does test whether participants have generally different expectations about the performance of social advisors or non-social predictors. There was no difference in this measure between social or non-social cues (Figure below; paired t-test social vs. non-social, t(23)= 0.01, p=0.98, 95% CI= [-0.067 0.68]).

      Figure. Confidence interval for the first encounter of each predictor in social and non-social conditions. There was no initial bias in predicting the performance of social or non-social predictors.

      Learning across time We have now seen that participants do not have an initial bias when predicting performances in social or non-social conditions. This suggests that differences between conditions might emerge across time when encountering predictors multiple times. We tested whether inherent differences in how beliefs are updated according to new observations might result in different impacts of uncertainty on interval setting between social and non-social conditions. More specifically, we tested whether the integration of new evidence differed between social and non-social conditions; for example, recent observations might be weighted more strongly for non-social cues while past observations might be weighted more strongly for social cues. This approach was inspired by the reviewer’s comments about potential differences in the speed of learning as well as the reduction of uncertainty with increasing predictor encounters. Similar ideas were tested in previous studies, when comparing the learning rate (i.e. the speed of learning) in environments of different volatilities 12,13. In these studies, a smaller learning rate was prevalent in stable environments during which reward rates change slower over time, while higher learning rates often reflect learning in volatile environments so that recent observations have a stronger impact on behaviour. Even though most studies derived these learning rates with reinforcement learning models, similar ideas can be translated into a Bayesian model. For example, an established way of changing the speed of learning in a Bayesian model is to introduce noise during the update process14. This noise is equivalent to adding in some of the initial prior distribution and this will make the Bayesian updates more flexible to adapt to changing environments. It will widen the belief distribution and thereby make it more uncertain. Recent information has more weight on the belief update within a Bayesian model when beliefs are uncertain. This increases the speed of learning. In other words, a wide distribution (after adding noise) allows for quick integration of new information. On the contrary, a narrow distribution does not integrate new observations as strongly and instead relies more heavily on previous information; this corresponds to a small learning rate. So, we would expect a steep decline of uncertainty to be related to a smaller learning index while a slower decline of uncertainty is related to a larger learning index. We hypothesized that participants reduce their uncertainty quicker when observing social information, thereby anchoring more strongly on previous beliefs instead of integrating new observations flexibly. Vice versa, we hypothesized a less steep decline of uncertainty when observing non-social information, indicating that new information can be flexibly integrated during the belief update (new Figure 3a).

      We modified the original Bayesian model (Figure 2d, Figure 2 – figure supplement 2) by adding a uniform distribution (equivalent to our prior distribution) to each belief update – we refer to this as noise addition to the Bayesian model14,21 . We varied the amount of noise between δ = [0,1], while δ= 0 equals the original Bayesian model and δ= 1 represents a very noisy Bayesian model. The uniform distribution was selected to match the first prior belief before any observation was made (equation 2). This δ range resulted in a continuous increase of subjective uncertainty around the belief about the angular error (Figure 3b-c). The modified posterior distribution denoted as 𝑝′(σ x) was derived at each trial as follows:

      We applied each noisy Bayesian model to participants’ choices within the social and nonsocial condition.

      The addition of a uniform distribution changed two key features of the belief distribution: first, the width of the distribution remains larger with additional observations, thereby making it possible to integrate new observations more flexibly. To show this more clearly, we extracted the model-derived uncertainty estimate across multiple encounters of the same predictor for the original model and the fully noisy Bayesian model (Figure 3 – figure supplement 1). The model-derived ‘uncertainty estimate’ of a noisy Bayesian model decays more slowly compared to the ‘uncertainty estimate’ of the original Bayesian model (upper panel). Second, the model-derived ‘accuracy estimate’ reflects more recent observations in a noisy Bayesian model compared to the ‘accuracy estimate’ derived from the original Bayesian model, which integrates past observations more strongly (lower panel). Hence, as mentioned beforehand, a rapid decay of uncertainty implies a small learning index; or in other words, stronger integration of past compared to recent observations.

      In the following analyses, we tested whether an increasingly noisy Bayesian model mimics behaviour that is observed in the non-social compared to social condition. For example, we tested whether an increasingly noisy Bayesian model also exhibits a strongly negative ‘predictor uncertainty’ effect on interval setting (Figure 2e). In such a way, we can test whether differences in noise in the updating process of a Bayesian model might reproduce important qualitative differences in learning-related behaviour seen in the social and nonsocial conditions.

      We used these modified Bayesian models to simulate trial-wise interval setting for each participant according to the observations they made when selecting a particular advisor or non-social cue. We simulated interval setting at each trial and examined whether an increase in noise produced model behaviours that resembled participant behaviour patterns observed in the non-social condition as opposed to social condition. At each trial, we used the accuracy estimate (Methods, equation 6) – which represents a subjective belief about a single angular error -- to derive an interval setting for the selected predictor. To do so, we first derived the point-estimate of the belief distribution at each trial (Methods, equation 6) and multiplied it with the size of one interval step on the circle. The step size was derived by dividing the circle size by the maximum number of possible steps. Here is an example of transforming an accuracy estimate into an interval: let’s assume the belief about the angular error at the current trial is 50 (Methods, equation 6). Now, we are trying to transform this number into an interval for the current predictor on a given trial. To obtain the size of one interval step, the circle size (360 degrees) is divided by the maximum number of interval steps (40 steps; note, 20 steps on each side), which results in nine degrees that represents the size of one interval step. Next, the accuracy estimate in radians (0,87) is multiplied by the step size in radians (0,1571) resulting in an interval of 0,137 radians or 7,85 degrees. The final interval size would be 7,85.

      Simulating Bayesian choices in that way, we repeated the behavioural analyses (Figure 2b,e,f) to test whether intervals derived from more noisy Bayesian models mimic intervals set by participants in the non-social condition: greater changes in interval setting across trials (Figure 3 – figure supplement 1b), a negative ‘predictor uncertainty' effect on interval setting (Figure 3 – figure supplement 1c), and a higher learning index (Figure 3d).

      First, we repeated the most crucial analysis -- the linear regression analysis (Figure 2e) and hypothesized that intervals that were simulated from noisy Bayesian models would also show a greater negative ‘predictor uncertainty’ effect on interval setting. This was indeed the case: irrespective of social or non-social conditions, the addition of noise (increased weighting of the uniform distribution in each belief update) led to an increasingly negative ‘predictor uncertainty’ effect on confidence judgment (new Figure 3d). In Figure 3d, we show the regression weights (y-axis) for the ‘predictor uncertainty’ on confidence judgment with increasing noise (x-axis). This result is highly consistent with the idea that that in the non-social condition the manner in which task estimates are updated is more uncertain and more noisy. By contrast, social estimates appear relatively more stable, also according to this new Bayesian simulation analysis.

      This new finding extends the results and suggests a formal computational account of the behavioural differences between social and non-social conditions. Increasing the noise of the belief update mimics behaviour that is observed in the non-social condition: an increasingly negative effect of ‘predictor uncertainty’ on confidence judgment. Noteworthily, there was no difference in the impact that the noise had in the social and non-social conditions. This was expected because the Bayesian simulations are blind to the framing of the conditions. However, it means that the observed effects do not depend on the precise sequence of choices that participants made in these conditions. It therefore suggests that an increase in the Bayesian noise leads to an increasingly negative impact of ‘predictor uncertainty’ on confidence judgments irrespective of the condition. Hence, we can conclude that different degrees of uncertainty within the belief update is a reasonable explanation that can underlie the differences observed between social and non-social conditions.

      Next, we used these simulated confidence intervals and repeated the descriptive behavioural analyses to test whether interval settings that were derived from more noisy Bayesian models mimic behavioural patterns observed in non-social compared to social conditions. For example, more noise in the belief update should lead to more flexible integration of new information and hence should potentially lead to a greater change of confidence judgments across predictor encounters (Figure 2b). Further, a greater reliance on recent information should lead to prediction errors more strongly in the next confidence judgment; hence, it should result in a higher learning index in the non-social condition that we hypothesize to be perceived as more uncertain (Figure 2f). We used the simulated confidence interval from Bayesian models on a continuum of noise integration (i.e. different weighting of the uniform distribution into the belief update) and derived again both absolute confidence change and learning indices (Figure 3 – figure supplement 1b-c).

      ‘Absolute confidence change’ and ‘learning index’ increase with increasing noise weight, thereby mimicking the difference between social and non-social conditions. Further, these analyses demonstrate the tight relationship between descriptive analyses and model-based analyses. They show that a noise in the Bayesian updating process is a conceptual explanation that can account for both the differences in learning and the difference in uncertainty processing that exist between social and non-social conditions. The key insight conveyed by the Bayesian simulations is that a wider, more uncertain belief distribution changes more quickly. Correspondingly, in the non-social condition, participants express more uncertainty in their confidence estimate when they set the interval, and they also change their beliefs more quickly as expressed in a higher learning index. Therefore, noisy Bayesian updating can account for key differences between social and non-social condition.

      We thank the reviewer for making this point, as we believe that these additional analyses allow theoretical inferences to be made in a more direct manner; we think that it has significantly contributed towards a deeper understanding of the mechanisms involved in the social and non-social conditions. Further, it provides a novel account of how we make judgments when being presented with social and non-social information.

      We made substantial changes to the main text, figures and supplementary material to include these changes:

      Main text, page 10-11 new section:

      The impact of noise in belief updating in social and non-social conditions

      So far, we have shown that, in comparison to non-social predictors, participants changed their interval settings about social advisors less drastically across time, relied on observations made further in the past, and were less impacted by their subjective uncertainty when they did so (Figure 2). Using Bayesian simulation analyses, we investigated whether a common mechanism might underlie these behavioural differences. We tested whether the integration of new evidence differed between social and non-social conditions; for example, recent observations might be weighted more strongly for non-social cues while past observations might be weighted more strongly for social cues. Similar ideas were tested in previous studies, when comparing the learning rate (i.e. the speed of learning) in environments of different volatilities12,13. We tested these ideas using established ways of changing the speed of learning during Bayesian updates14,21. We hypothesized that participants reduce their uncertainty quicker when observing social information. Vice versa, we hypothesized a less steep decline of uncertainty when observing non-social information, indicating that new information can be flexibly integrated during the belief update (Figure 5a).

      We manipulated the amount of uncertainty in the Bayesian model by adding a uniform distribution to each belief update (Figure 3b-c) (equation 10,11). Consequently, the distribution’s width increases and is more strongly impacted by recent observations (see example in Figure 3 – figure supplement 1). We used these modified Bayesian models to simulate trial-wise interval setting for each participant according to the observations they made by selecting a particular advisor in the social condition or other predictor in the nonsocial condition. We simulated confidence intervals at each trial. We then used these to examine whether an increase in noise led to simulation behaviour that resembled behavioural patterns observed in non-social conditions that were different to behavioural patterns observed in the social condition.

      First, we repeated the linear regression analysis and hypothesized that interval settings that were simulated from noisy Bayesian models would also show a greater negative ‘predictor uncertainty’ effect on interval setting resembling the effect we had observed in the nonsocial condition (Figure 2e). This was indeed the case when using the noisy Bayesian model: irrespective of social or non-social condition, the addition of noise (increasing weight of the uniform distribution to each belief update) led to an increasingly negative ‘predictor uncertainty’ effect on confidence judgment (new Figure 3d). The absence of difference between the social and non-social conditions in the simulations, suggests that an increase in the Bayesian noise is sufficient to induce a negative impact of ‘predictor uncertainty’ on interval setting. Hence, we can conclude that different degrees of noise in the updating process are sufficient to cause differences observed between social and non-social conditions. Next, we used these simulated interval settings and repeated the descriptive behavioural analyses (Figure 2b,f). An increase in noise led to greater changes of confidence across time and a higher learning index (Figure 3 – figure supplement 1b-c). In summary, the Bayesian simulations offer a conceptual explanation that can account for both the differences in learning and the difference in uncertainty processing that exist between social and non-social conditions. The key insight conveyed by the Bayesian simulations is that a wider, more uncertain belief distribution changes more quickly. Correspondingly, in the non-social condition, participants express more uncertainty in their confidence estimate when they set the interval, and they also change their beliefs more quickly. Therefore, noisy Bayesian updating can account for key differences between social and non-social condition.

      Methods, page 23 new section:

      Extension of Bayesian model with varying amounts of noise

      We modified the original Bayesian model (Figure 2d, Figure 2 – figure supplement 2) to test whether the integration of new evidence differed between social and non-social conditions; for example, recent observations might be weighted more strongly for non-social cues while past observations might be weighted more strongly for social cues. [...] To obtain the size of one interval step, the circle size (360 degrees) is divided by the maximum number of interval steps (40 steps; note, 20 steps on each side), which results in nine degrees that represents the size of one interval step. Next, the accuracy estimate in radians (0,87) is multiplied by the step size in radians (0,1571) resulting in an interval of 0,137 radians or 7,85 degrees. The final interval size would be 7,85.

      We repeated behavioural analyses (Figure 2b,e,f) to test whether confidence intervals derived from more noisy Bayesian models mimic behavioural patterns observed in the nonsocial condition: greater changes of confidence across trials (Figure 3 – figure supplement 1b), a greater negative ‘predictor uncertainty' on confidence judgment (Figure 3 – figure supplement 1c) and a greater learning index (Figure 3d).

      Discussion, page 14: […] It may be because we make just such assumptions that past observations are used to predict performance levels that people are likely to exhibit next 15,16. An alternative explanation might be that participants experience a steeper decline of subjective uncertainty in their beliefs about the accuracy of social advice, resulting in a narrower prior distribution, during the next encounter with the same advisor. We used a series of simulations to investigate how uncertainty about beliefs changed from trial to trial and showed that belief updates about non-social cues were consistent with a noisier update process that diminished the impact of experiences over the longer term. From a Bayesian perspective, greater certainty about the value of advice means that contradictory evidence will need to be stronger to alter one’s beliefs. In the absence of such evidence, a Bayesian agent is more likely to repeat previous judgments. Just as in a confirmation bias 17, such a perspective suggests that once we are more certain about others’ features, for example, their character traits, we are less likely to change our opinions about them.

      Reviewer #2 (Public Review):

      Humans learn about the world both directly, by interacting with it, and indirectly, by gathering information from others. There has been a longstanding debate about the extent to which social learning relies on specialized mechanisms that are distinct from those that support learning through direct interaction with the environment. In this work, the authors approach this question using an elegant within-subjects design that enables direct comparisons between how participants use information from social and non-social sources. Although the information presented in both conditions had the same underlying structure, participants tracked the performance of the social cue more accurately and changed their estimates less as a function of prediction error. Further, univariate activity in two regions-dmPFC and pTPJ-tracked participants' confidence judgments more closely in the social than in the non-social condition, and multivariate patterns of activation in these regions contained information about the identity of the social cues.

      Overall, the experimental approach and model used in this paper are very promising. However, after reading the paper, I found myself wanting additional insight into what these condition differences mean, and how to place this work in the context of prior literature on this debate. In addition, some additional analyses would be useful to support the key claims of the paper.

      We thank the reviewer for their very supportive comments. We have addressed their points below and have highlighted changes in our manuscript that we made in response to the reviewer’s comments.

      (1) The framing should be reworked to place this work in the context of prior computational work on social learning. Some potentially relevant examples:

      • Shafto, Goodman & Frank (2012) provide a computational account of the domainspecific inductive biases that support social learning. In brief, what makes social learning special is that we have an intuitive theory of how other people's unobservable mental states lead to their observable actions, and we use this intuitive theory to actively interpret social information. (There is also a wealth of behavioral evidence in children to support this account; for a review, see Gweon, 2021).

      • Heyes (2012) provides a leaner account, arguing that social and non-social learning are supported by a common associative learning mechanism, and what distinguishes social from non-social learning is the input mechanism. Social learning becomes distinctively "social" to the extent that organisms are biased or attuned to social information.

      I highlight these papers because they go a step beyond asking whether there is any difference between mechanisms that support social and nonsocial learning-they also provide concrete proposals about what that difference might be, and what might be shared. I would like to see this work move in a similar direction.

      References<br /> (In the interest of transparency: I am not an author on these papers.)

      Gweon, H. (2021). Inferential social learning: how humans learn from others and help others learn. PsyArXiv. https://doi.org/10.31234/osf.io/8n34t

      Heyes, C. (2012). What's social about social learning?. Journal of Comparative Psychology, 126(2), 193.

      Shafto, P., Goodman, N. D., & Frank, M. C. (2012). Learning from others: The consequences of psychological reasoning for human learning. Perspectives on Psychological Science, 7(4), 341-351.

      Thank you for this suggestion to expand our framing. We have now made substantial changes to the Discussion and Introduction to include additional background literature, the relevant references suggested by the reviewer, addressing the differences between social and non-social learning. We further related our findings to other discussions in the literature that argue that differences between social and non-social learning might occur at the level of algorithms (the computations involved in social and non-social learning) and/or implementation (the neural mechanisms). Here, we describe behaviour with the same algorithm (Bayesian model), but the weighing of uncertainty on decision-making differs between social and non-social contexts. This might be explained by similar ideas put forward by Shafto and colleagues (2012), who suggest that differences between social and non-social learning might be due to the attribution of goal-directed intention to social agents, but not non-social cues. Such an attribution might lead participants to assume that advisor performances will be relatively stable under the assumption that they should have relatively stable goal-directed intentions. We also show differences at the implementational level in social and non-social learning in TPJ and dmPFC.

      Below we list the changes we have made to the Introduction and Discussion. Further, we would also like to emphasize the substantial extension of the Bayesian modelling which we think clarifies the theoretical framework used to explain the mechanisms involved in social and non-social learning (see our answer to the next comments below).

      Introduction, page 4:

      [...]<br /> Therefore, by comparing information sampling from social versus non-social sources, we address a long-standing question in cognitive neuroscience, the degree to which any neural process is specialized for, or particularly linked to, social as opposed to non-social cognition 2–9. Given their similarities, it is expected that both types of learning will depend on common neural mechanisms. However, given the importance and ubiquity of social learning, it may also be that the neural mechanisms that support learning from social advice are at least partially specialized and distinct from those concerned with learning that is guided by nonsocial sources.

      However, it is less clear on which level information is processed differently when it has a social or non-social origin. It has recently been argued that differences between social and non-social learning can be investigated on different levels of Marr’s information processing theory: differences could emerge at an input level (in terms of the stimuli that might drive social and non-social learning), at an algorithmic level or at a neural implementation level 7. It might be that, at the algorithmic level, associative learning mechanisms are similar across social and non-social learning 1. Other theories have argued that differences might emerge because goal-directed actions are attributed to social agents which allows for very different inferences to be made about hidden traits or beliefs 10. Such inferences might fundamentally alter learning about social agents compared to non-social cues.

      Discussion, page 15:

      […] One potential explanation for the assumption of stable performance for social but not non-social predictors might be that participants attribute intentions and motivations to social agents. Even if the social and non-social evidence are the same, the belief that a social actor might have a goal may affect the inferences made from the same piece of information 10. Social advisors first learnt about the target’s distribution and accordingly gave advice on where to find the target. If the social agents are credited with goal-directed behaviour then it might be assumed that the goals remain relatively constant; this might lead participants to assume stability in the performances of social advisors. However, such goal-directed intentions might not be attributed to non-social cues, thereby making judgments inherently more uncertain and changeable across time. Such an account, focussing on differences in attribution in social settings aligns with a recent suggestion that any attempt to identify similarities or differences between social and non-social processes can occur at any one of a number of the levels in Marr’s information theory 7. Here we found that the same algorithm was able to explain social and non-social learning (a qualitatively similar computational model could explain both). However, the extent to which the algorithm was recruited when learning about social compared to non-social information differed. We observed a greater impact of uncertainty on judgments about social compared to non-social information. We have shown evidence for a degree of specialization when assessing social advisors as opposed to non-social cues. At the neural level we focused on two brain areas, dmPFC and pTPJ, that have not only been shown to carry signals associated with belief inferences about others but, in addition, recent combined fMRI-TMS studies have demonstrated the causal importance of these activity patterns for the inference process […]

      (2) The results imply that dmPFC and pTPJ differentiate between learning from social and non-social sources. However, more work needs to be done to rule out simpler, deflationary accounts. In particular, the condition differences observed in dmPFC and pTPJ might reflect low-level differences between the two conditions. For example, the social task could simply have been more engaging to participants, or the social predictors may have been more visually distinct from one another than the fruits.

      We understand the reviewer’s concern regarding low-level distinctions between the social and non-social condition that could confound for the differences in neural activation that are observed between conditions in areas pTPJ and dmPFC. From the reviewer’s comments, we understand that there might be two potential confounders: first, low-level differences such that stimuli within one condition might be more distinct to each other compared to the relative distinctiveness between stimuli within the other condition. Therefore, simply the greater visual distinctiveness of stimuli in one condition than another might lead to learning differences between conditions. Second, stimuli in one condition might be more engaging and potentially lead to attentional differences between conditions. We used a combination of univariate analyses and multivariate analyses to address both concerns.

      Analysis 1: Univariate analysis to inspect potential unaccounted variance between social and non-social condition

      First, we used the existing univariate analysis (exploratory MRI whole-brain analysis, see Methods) to test for neural activation that covaried with attentional differences – or any other unaccounted neural difference -- between conditions. If there were neural differences between conditions that we are currently not accounting for with the parametric regressors that are included in the fMRI-GLM, then these differences should be captured in the constant of the GLM model. For example, if there are attentional differences between conditions, then we could expect to see neural differences between conditions in areas such as inferior parietal lobe (or other related areas that are commonly engaged during attentional processes).

      Importantly, inspection of the constant of the GLM model should capture any unaccounted differences, whether they are due to attention or alternative processes that might differ between conditions. When inspecting cluster-corrected differences in the constant of the fMRI-GLM model during the setting of the confidence judgment, there were no clustersignificant activation that was different between social and non-social conditions (Figure 4 – figure supplement 4a; results were familywise-error cluster-corrected at p<0.05 using a cluster-defining threshold of z>2.3). For transparency, we show the sub-threshold activation map across the whole brain (z > 2) for the ‘constant’ contrasted between social and nonsocial condition (i.e. constant, contrast: social – non-social).

      For transparency we additionally used an ROI-approach to test differences in activation patterns that correlated with the constant during the confidence phase – this means, we used the same ROI-approach as we did in the paper to avoid any biased test selection. We compared activation patterns between social and non-social conditions in the same ROI as used before; dmPFC (MNI-coordinate [x/y/z: 2,44,36] 16), bilateral pTPJ (70% probability anatomical mask; for reference see manuscript, page 23) and additionally compared activation patterns between conditions in bilateral IPLD (50% probability anatomical mask, 20). We did not find significantly different activation patterns between social and non-social conditions in any of these areas: dmPFC (confidence constant; paired t-test social vs nonsocial: t(23) = 0.06, p=0.96, [-36.7, 38.75]), bilateral TPJ (confidence constant; paired t-test social vs non-social: t(23) = -0.06, p=0.95, [-31, 29]), bilateral IPLD (confidence constant; paired t-test social vs non-social: t(23) = -0.58, p=0.57, [-30.3 17.1]).

      There were no meaningful activation patterns that differed between conditions in either areas commonly linked to attention (eg IPL) or in brain areas that were the focus of the study (dmPFC and pTPJ). Activation in dmPFC and pTPJ covaried with parametric effects such as the confidence that was set at the current and previous trial, and did not correlate with low-level differences such as attention. Hence, these results suggest that activation between conditions was captured better by parametric regressors such as the trial-wise interval setting, i.e. confidence, and are unlikely to be confounded by low-level processes that can be captured with univariate neural analyses.

      Analysis 2: RSA to test visual distinctiveness between social and non-social conditions

      We addressed the reviewer’s other comment further directly by testing whether potential differences between conditions might arise due to a varying degree of visual distinctiveness in one stimulus set compared to the other stimulus set. We used RSA analysis to inspect potential differences in early visual processes that should be impacted by greater stimulus similarity within one condition. In other words, we tested whether the visual distinctiveness of one stimuli set was different to the visual distinctiveness of the other stimuli set. We used RSA analysis to compare the Exemplar Discriminability Index (EDI) between conditions in early visual areas. We compared the dissimilarity of neural activation related to the presentation of an identical stimulus across trials (diagonal in RSA matrix) with the dissimilarity in neural activation between different stimuli across trials (off-diagonal in RSA matrix). If stimuli within one stimulus set are very similar, then the difference between the diagonal and off-diagonal should be very small and less likely to be significant (i.e. similar diagonal and off-diagonal values). In contrast, if stimuli within one set are very distinct from each other, then the difference between the diagonal and off-diagonal should be large and likely to result in a significant EDI (i.e. different diagonal and off-diagonal values) (see Figure 4g for schematic illustration). Hence, if there is a difference in the visual distinctiveness between social and non-social conditions, then this difference should result in different EDI values for both conditions – hence, visual distinctiveness between the stimuli set can be tested by comparing the EDI values between conditions within the early visual processing. We used a Harvard-cortical ROI mask based on bilateral V1. Negative EDI values indicate that the same exemplars are represented more similarly in the neural V1 pattern than different exemplars. This analysis showed that there was no significant difference in EDI between conditions (Figure 4 – figure supplement 4b; EDI paired sample t-test: t(23) = -0.16, p=0.87, 95% CI [-6.7 5.7]).

      We have further replicated results in V1 with a whole-brain searchlight analysis, averaging across both social and non-social conditions.

      In summary, by using a combination of univariate and multivariate analyses, we could test whether neural activation might be different when participants were presented with a facial or fruit stimuli and whether these differences might confound observed learning differences between conditions. We did not find meaningful neural differences that were not accounted for with the regressors included in the GLM. Further, we did not find differences in the visual distinctiveness between the stimuli sets. Hence, these control analyses suggest that differences between social and non-social conditions might not arise because of differences in low-level processes but are instead more likely to develop when learning about social or non-social information.

      Moreover, we also examined behaviourally whether participants differed in the way they approached social and non-social condition. We tested whether there were initial biases prior to learning, i.e. before actually receiving information from either social or non-social information sources. Therefore, we tested whether participants have different prior expecations about the performance of social compared to non-social predictors. We compared the confidence judgments at the first trial of each predictor. We found that participants set confidence intervals very similarly in social and non-social conditions (Figure below). Hence, it did not seem to be the case that differences between conditions arose due to low level differences in stimulus sets or prior differences in expectations about performances of social compared to non-social predictors. However, we can show that differences between conditions are apparent when updating one’s belief about social advisors or non-social cues and as a consequence, in the way that confidence judgments are set across time.

      Figure. Confidence interval for the first encounter of each predictor in social and non-social conditions. There was no initial bias in predicting the performance of social or non-social predictors.

      Main text page 13:

      [… ]<br /> Additional control analyses show that neural differences between social and non-social conditions were not due to the visually different set of stimuli used in the experiment but instead represent fundamental differences in processing social compared to non-social information (Figure 4 – figure supplement 4). These results are shown in ROI-based RSA analysis and in whole-brain searchlight analysis. In summary, in conjunction, the univariate and multivariate analyses demonstrate that dmPFC and pTPJ represent beliefs about social advisors that develop over a longer timescale and encode the identities of the social advisors.

      References

      1. Heyes, C. (2012). What’s social about social learning? Journal of Comparative Psychology 126, 193–202. 10.1037/a0025180.
      2. Chang, S.W.C., and Dal Monte, O. (2018). Shining Light on Social Learning Circuits. Trends in Cognitive Sciences 22, 673–675. 10.1016/j.tics.2018.05.002.
      3. Diaconescu, A.O., Mathys, C., Weber, L.A.E., Kasper, L., Mauer, J., and Stephan, K.E. (2017). Hierarchical prediction errors in midbrain and septum during social learning. Soc Cogn Affect Neurosci 12, 618–634. 10.1093/scan/nsw171.
      4. Frith, C., and Frith, U. (2010). Learning from Others: Introduction to the Special Review Series on Social Neuroscience. Neuron 65, 739–743. 10.1016/j.neuron.2010.03.015.
      5. Frith, C.D., and Frith, U. (2012). Mechanisms of Social Cognition. Annu. Rev. Psychol. 63, 287–313. 10.1146/annurev-psych-120710-100449.
      6. Grabenhorst, F., and Schultz, W. (2021). Functions of primate amygdala neurons in economic decisions and social decision simulation. Behavioural Brain Research 409, 113318. 10.1016/j.bbr.2021.113318.
      7. Lockwood, P.L., Apps, M.A.J., and Chang, S.W.C. (2020). Is There a ‘Social’ Brain? Implementations and Algorithms. Trends in Cognitive Sciences, S1364661320301686. 10.1016/j.tics.2020.06.011.
      8. Soutschek, A., Ruff, C.C., Strombach, T., Kalenscher, T., and Tobler, P.N. (2016). Brain stimulation reveals crucial role of overcoming self-centeredness in self-control. Sci. Adv. 2, e1600992. 10.1126/sciadv.1600992.
      9. Wittmann, M.K., Lockwood, P.L., and Rushworth, M.F.S. (2018). Neural Mechanisms of Social Cognition in Primates. Annu. Rev. Neurosci. 41, 99–118. 10.1146/annurev-neuro080317-061450.
      10. Shafto, P., Goodman, N.D., and Frank, M.C. (2012). Learning From Others: The Consequences of Psychological Reasoning for Human Learning. Perspect Psychol Sci 7, 341– 351. 10.1177/1745691612448481.
      11. McGuire, J.T., Nassar, M.R., Gold, J.I., and Kable, J.W. (2014). Functionally Dissociable Influences on Learning Rate in a Dynamic Environment. Neuron 84, 870–881. 10.1016/j.neuron.2014.10.013.
      12. Behrens, T.E.J., Woolrich, M.W., Walton, M.E., and Rushworth, M.F.S. (2007). Learning the value of information in an uncertain world. Nature Neuroscience 10, 1214– 1221. 10.1038/nn1954.
      13. Meder, D., Kolling, N., Verhagen, L., Wittmann, M.K., Scholl, J., Madsen, K.H., Hulme, O.J., Behrens, T.E.J., and Rushworth, M.F.S. (2017). Simultaneous representation of a spectrum of dynamically changing value estimates during decision making. Nat Commun 8, 1942. 10.1038/s41467-017-02169-w.
      14. Allenmark, F., Müller, H.J., and Shi, Z. (2018). Inter-trial effects in visual pop-out search: Factorial comparison of Bayesian updating models. PLoS Comput Biol 14, e1006328. 10.1371/journal.pcbi.1006328.
      15. Wittmann, M., Trudel, N., Trier, H.A., Klein-Flügge, M., Sel, A., Verhagen, L., and Rushworth, M.F.S. (2021). Causal manipulation of self-other mergence in the dorsomedial prefrontal cortex. Neuron.
      16. Wittmann, M.K., Kolling, N., Faber, N.S., Scholl, J., Nelissen, N., and Rushworth, M.F.S. (2016). Self-Other Mergence in the Frontal Cortex during Cooperation and Competition. Neuron 91, 482–493. 10.1016/j.neuron.2016.06.022.
      17. Kappes, A., Harvey, A.H., Lohrenz, T., Montague, P.R., and Sharot, T. (2020). Confirmation bias in the utilization of others’ opinion strength. Nat Neurosci 23, 130–137. 10.1038/s41593-019-0549-2.
      18. Trudel, N., Scholl, J., Klein-Flügge, M.C., Fouragnan, E., Tankelevitch, L., Wittmann, M.K., and Rushworth, M.F.S. (2021). Polarity of uncertainty representation during exploration and exploitation in ventromedial prefrontal cortex. Nat Hum Behav. 10.1038/s41562-020-0929-3.
      19. Yu, Z., Guindani, M., Grieco, S.F., Chen, L., Holmes, T.C., and Xu, X. (2022). Beyond t test and ANOVA: applications of mixed-effects models for more rigorous statistical analysis in neuroscience research. Neuron 110, 21–35. 10.1016/j.neuron.2021.10.030.
      20. Mars, R.B., Jbabdi, S., Sallet, J., O’Reilly, J.X., Croxson, P.L., Olivier, E., Noonan, M.P., Bergmann, C., Mitchell, A.S., Baxter, M.G., et al. (2011). Diffusion-Weighted Imaging Tractography-Based Parcellation of the Human Parietal Cortex and Comparison with Human and Macaque Resting-State Functional Connectivity. Journal of Neuroscience 31, 4087– 4100. 10.1523/JNEUROSCI.5102-10.2011.
      21. Yu, A.J., and Cohen, J.D. Sequential effects: Superstition or rational behavior? 8.
      22. Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., and Kriegeskorte, N. (2014). A Toolbox for Representational Similarity Analysis. PLoS Comput Biol 10, e1003553. 10.1371/journal.pcbi.1003553.
      23. Lockwood, P.L., Wittmann, M.K., Nili, H., Matsumoto-Ryan, M., Abdurahman, A., Cutler, J., Husain, M., and Apps, M.A.J. (2022). Distinct neural representations for prosocial and self-benefiting effort. Current Biology 32, 4172-4185.e7. 10.1016/j.cub.2022.08.010.
    1. Author Response

      Reviewer #2 (Public Review):

      1) Although the images and videos were of great quality, the results derived from them provided little new knowledge and few conceptual insights into male reproductive tract biology and basically confirmed what has been published using traditional methods. For example, the high intensity of the vascular network in the initial segment was previously reported by Abe in 1984 and Suzuki in 1982; the pattern of the major lymphatic vessel and drainage was beautifully depicted by Perez-Clavier, 1982.

      We thank the reviewer for his/her appreciative comments regarding the quality of the images/videos we provide in this study. We do not fully agree with his/her assessment of the lack of novelty. Our work confirms earlier reports that are now dated (1980s), which in itself is worth mentioning for the interested community, especially when the confirmation uses the most advanced technologies available today. We have never said that nothing was done in the past, and we have acknowledged all past contributors (including those mentioned by the reviewer) by pointing out the limitations of the technical tools that were available at the time. In addition, our current work provides a more comprehensive and global view by extending our approach to the entire mouse epididymis, whereas previous work was much more limited.

      2) The authors were very cautious when interpreting the results of marker immunostaining however these markers were not specific for a definite cell type. For example, as the authors stated, VEGFR3 marks both lymphatic vessels and fenestrated blood vessels. how could the authors claim the VEGFR3+ network was lymphatic? The authors claimed that they used three markers for the lymphatic vessel. But staining results of the networks were very different. How could the author make conclusions about the network of lymphatic vessels in the epididymis?

      We broadly agree with the reviewer and have made it clear that one cannot be 100% sure that all the VEGFR3+ structures we present are lymphatic. However, in total, we used 4 documented lymphatic markers (not 3 as mentioned by the reviewer) which are (VEGFR3, LYVE1, PROX1 and PDPN). Three of them give very similar profiles, while only PDPN shows some differences. We are currently studying in more detail the expression of PDPN in the mouse epididymis because we speculate that this marker may target a population of pluripotent cells in this tissue. Therefore, with the 3 similar profiles and with the subtraction of PVLAP+ structures, we are pretty confident that what we show corresponds to the different lymphatic structures.

      3) To understand the vascular network development in the epididymis, would the authors please look at the fetal stage when the vascular network is established in the first place? Wolffian duct tissues are much smaller and thinner and would be amenable for 3D imaging probably even without clearing.

      We generally agree with the reviewer that this could be an interesting addition. However, it represents a significant amount of additional work. Organ clearing will certainly be required because it is unlikely that Wolffian duct will be sufficiently transparent to allow lightsheet microscopy. In the literature, the study of Wolffian duct relies primarily on whole mounts, inclusions, and cryosections. Besides the fact that this represents a lot of extra work, we are not totally convinced that this would be of much use. A key reason is that the epididymis is an organ that differentiates completely after birth (Robaire and Hinton, 2015). It is reported that differentiation of mouse caput segment 1 occurs around 19DPN (Xu et al., 2016) and is intimately related to the development of the vasculature (Lebarr et al., 1986). Regarding the lymphatic network, Swingen et al, (2012) reports that lymphangiogenesis in the mouse testis and epididymis is initiated late in gestation after 15DPC. Videos showing the external lymphatic vessels of the testis and epididymis at 17.5DPC can be seen at https://doi.org/10.1371/journal.pone.0052620.s002. The authors indicate that lymphangiogenesis occurs via sprouting from the adjacent mesonephros. We hypothesize that the more internal lymphatics evolve between birth and 10DPN, which corresponds to the time when we observed LEPC Lyve1pos cells.

      4) Immunofluorescence staining of VEGF factors was not convincing. As a secreted factor, VEGF will be secreted out of the cells, would it be detected more in the interstitium? I am always skeptical about the results of immunostaining secreted growth factors. Would it be possible to perform in situ or RNAscope to confirm the spatial expression pattern of VEGFs?

      Well, active VEGF factors result from alternative mRNA splicing events and posttranslational proteolytic cleavage. Therefore, in our opinion, the study of VEGF mRNA by in situ hybridization or RNAscope analysis will not be very informative about the actual presence of active forms of VEGF in the epididymis. If necessary, we can provide as supplementary material immunohistochemistry data showing the presence of VEFG-A in the epididymal principal cells. Our major objective with these data was to show that VEGF factors and their respective receptors were present in the epididymis. Nevertheless, in an attempt to convince the reviewer, we provide as accompanying data to this rebuttal letter new sets of figures (Figures VEGF-A-response editor & VEGFC /VEGF-D-response editor) that we believe can improve the perception of our data. If the editorial office feels it is necessary, these figures could be added to the supplementary figure set (as Figure 6figure supplement 1 and Figure 6-figure supplement 2). For VEGF-A the data exists already in the literature as we have indicated (Korpelainen, 1998). In fine, our goal was not to show which cell types of the epididymis epithelium produce VEGFs but rather than VEGF factors and their receptors where there in order to support angiogenesis or lymphangiogenic activity in the tissue. In addition, we hypothesize that because septa have been reported to constitute barriers between segments restricting passive diffusion of molecules (Turner et al., 2003; Stammler et al., 2015), the VEGF factors are expected to be produced locally.

      Figure VEGF-A - response editor : Immunofluorescence of the angiogenic ligand VEGF-A in the epididymis. Figure 6 shows that this ligand is mainly found in the caput and more precisely in S1.It is very strongly expressed in the peritubular microvascularization of the SI which expresses the VEGFR3:YFP transgene whereas it is less expressed by intertubular blood vessels (asterisk). This seems to indicate that it is the peritubular vessels that are in the majority responsible for the angiogenic activity measured in our study. Furthermore, it is expressed by the epithelium as secretory vesicles (IS, and S3 and enlargement) which is in agreement with in situ hybridization work performed by Korpelainene E.I et al J.Cell.biol 1998). The enlargement shown in S3_Z shows the sagital plane of the tubule where one can distinguish VEGFR:YFP positive cells that strongly express are also VEGF-A positive indicating that the same cells of the epithelium express both the receptor and the ligand. Here the transgene is detected directly without the use of an anti-GFP which allows to enhance the signal.

      Figure VEGF-C / VEGF-D - response editor : Immunofluorescence of VEGF-C and VEGF-D lymphangiogenic ligands in the epididymis. This figure shows that these ligands are mainly found in the interstitial tissue throughout the organ with a higher proportion in the caudal part. This expression may be largely driven by fibroblasts, which are widely represented in the interstitium, or by endothelial cells, since these two ligands are expressed by these cell types. However, as shown in the figures and in the enlargement of panel A, VEGF-C is also produced by epithelial cells within what may appear as secretory vesicles. In contrast, for VEGF-D, we observe only few weakly positive epithelial cells (panel B). These ligands are also detected in the lumen of epididymal tubules (visible for VEGF-C Panel A S2). This presence may be explained by lumicrine transfer from the testis, in addition to secretion from epithelial cells. Here the transgene is detected directly without the use of an anti-GFP which allows to enhance the signal.

      5) The study is descriptive and does not provide functional and mechanistic insights. Maybe, the combination of 3D imaging with lineage tracing of endothelium cells or ligation study (removal/ligation of the certain vessel) would help better understand how the vascular network is established and their functional significance.

      The technical approaches suggested by the reviewer could certainly improve our understanding of the rather complex epididymal vascular network. Taken together, they represent the body of a comprehensive follow-up study that is worth undertaking.

      6) Immune response is among many physiological processes in which vascular networks play significant roles. Discussion would be needed in other physiological processes, such as tissue metabolism and stem/progenitor cell niche microenvironment.

      We agree with the reviewer that the mammalian vasculature is involved in other physiological processes beyond immune/inflammatory responses. We have deliberately chosen to focus our discussion on the inflammatory and immune context of the epididymis, as we believe this is the most relevant aspect. It is also in full agreement with the research that our team has been conducting for 15 years to try to understand the complex orchestration of tolerance versus immune surveillance in this territory. This is a finely tuned process that, if properly understood, can help to understand and appropriately treat clinical situations of infertility and/or urological problems. As our discussion section is already quite long, we feel that it was not justified to extend it further on other aspects. However, in response to the reviewer's suggestion, we now mention at the end of the first paragraph of the discussion that the epididymal vascular network is likely to serve different processes in this tissue (page 9, lines 299 to 303).

      7) How could the author determine the Cd-A labeled vessel in Fig 1 was an artery, not a vein? This leads to another critical question. Would it be possible to stain with artery and vein markers to help illustrate the blood flow directions of the vessel?

      The reviewer is right on the fact that we arbitrarily called the Cd-A vessel in Figure 1 an artery. Cd-A is not an acronym we use anymore. What we have done is to use the acronym SEA (superior epididymal artery) to indicate what we firmly believe to be an artery, as also suggested by previous literature (e.g., Suzuki, 1982; Abe et al, 1982) in which this same structure has been consistently referred to as an artery. For other blood vessels, we now have used the acronym "Cd-BV" because we do not know whether we are dealing with a vein or an artery as rightfully pointed out by the reviewer. This is clearly stated in the legend of Figure 1.

    1. Author Response:

      Reviewer #1:

      The manuscript “A computationally designed fluorescent biosensor for D-serine" by Vongsouthi et al. reports the engineering of a fluorescent biosensor for D-serine using the D-alanine-specific solute-binding protein from Salmonella enterica (DalS) as a template. The authors engineer a DalS construct that has the enhanced cyan fluorescent protein (ECFP) and the Venus fluorescent protein (Venus) as terminal fusions, which serve as donor and acceptor fluorophores in resonance energy transfer (FRET) experiments. The reporters should monitor a conformational change induced by solute binding through a change of the FRET signal. The authors combine homology-guided rational protein engineering, in-silico ligand docking and computationally guided, stabilizing mutagenesis to transform DalS into a D-serine-specific biosensor applying iterative mutagenesis experiments. Functionality and solute affinity of modified DalS is probed using FRET assays. Vongsouthi et al. assess the applicability of the finally generated D-serine selective biosensor (D-SerFS) in-situ and in-vivo using fluorescence microscopy.

      Ionotropic glutamate receptors are ligand-gated ion channels that are importantly involved in brain development, learning, memory and disease. D-serine is a co-agonist of ionotropic glutamate receptors of the NMDA subtype. The modulation of NMDA signalling in the central nervous system through D-serine is hardly understood. Optical biosensors that can detect D-serine are lacking and the development of such sensors, as proposed in the present study, is an important target in biomedical research.

      The manuscript is well written and the data are clearly presented and discussed. The authors appear to have succeeded in the development of D-serine-selective fluorescent biosensor. But some questions arose concerning experimental design. Moreover, not all conclusions are fully supported by the data presented. I have the following comments.

      1) In the homology-guided design two residues in the binding site were mutated to the ones of the D-serine specific homologue NR1 (i.e. F117L and A147S), which lead to a significant increase of affinity to D-serine, as desired. The third residue, however, was mutated to glutamine (Y148Q) instead of the homologous valine (V), which resulted in a substantial loss of affinity to D-serine (Table 1). This "bad" mutation was carried through in consecutive optimization steps. Did the authors also try the homologous Y148V mutation? On page 5 the authors argue that Q instead of V would increase the size of the side chain pocket. But the opposite is true: the side chain of Q is more bulky than the one of V, which may explain the dramatic loss of affinity to D-serine. Mutation Y148V may be beneficial.

      Yes, we have previously tested the mutation of position 148 to valine (V). We have now included this data in the paper as Supplementary Information Figure 1 (below). The fluorescence titration showed that the 148V variant displayed poor D-serine specificity compared to Q148 at the same position (the sequence background of the variant was F117L/A147S/D216E/A76D. Thus, Q was superior to V at this position and V was not taken forward for further engineering. In the text, we meant that Q would increase the size of the side chain pocket relative to the wild-type amino acid, Y. We can see that this is unclear and have updated this sentence.

      Supplementary Figure 1. Dose-response curves for F117L/A147S/Y148V/D216E/A76D (LSVED) with glycine, D-alanine and D-serine. Values are the (475 nm/530 nm) fluorescence ratio as a percentage of the same ratio for the apo sensor. No significant change is detected in response to glycine. The KD for D-alanine and D-serine are estimated to be > 4000 mM based on fitting curves with the following equation:

      2) Stabilities of constructs were estimated from melting temperatures (Tm) measured using thermal denaturation probed using the FRET signal of ECFP/Venus fusions. I am not sure if this methodology is appropriate to determine thermal stabilities of DalS and mutants thereof. Thermal unfolding of the fluorescence labels ECFP and Venus and their intrinsic, supposedly strongly temperature-dependent fluorescence emission intensities will interfere. A deconvolution of signals will be difficult. It would be helpful to see raw data from these measurements. All stabilities are reported in terms of deltaTm. What is the absolute Tm of the reference protein DalS? How does the thermal stability of DalS compare to thermal stabilities of ECFP and Venus? A more reliable probe for thermal stability would be the far-UV circular dichroism (CD) spectroscopic signal of DalS without fusions. DalS is a largely helical domain and will show a strong CD signal.

      We agree that raw data for the thermal denaturation experiments should be shown and have included this in the supporting information of an updated manuscript (Supplementary Data Figure 7). The data plots ECFP/Venus fluorescence ratio against temperature. When the temperature is increased from 20 to 90 °C, we observe two transitions in the ECFP/Venus fluorescence ratio. The fluorescent proteins are more thermostable than the DalS binding protein, and that temperature transition does not vary (~90 °C); thus, the first transition corresponds to the unfolding of the binding protein and the second transition to the unfolding or loss of fluorescence from the fluorescent proteins. This is an appropriate method for characterising the thermostability of the binding protein in the sensor for two main reasons. Firstly, the calculated melting temperature from the first sigmoidal transition changes upon mutation to the binding protein in a predictable way (e.g. mutations to the binding site/protein core are destabilising), while the second transition occurs consistently at ~ 90 °C. This supports that the first transition corresponds to the unfolding of the binding protein. Secondly, characterising the stability of the binding protein in the context of the full sensor is more relevant to the end-application. Excising the binding domain and testing that in isolation would results in data that are not directly relevant to the sensor. The absolute thermostabilities for all variants can be found in Table 1 of the manuscript.

      Supplementary Figure 7. The (475 nm/530 nm) fluorescence ratio as a function of increasing temperature (20 – 90 °C) for key variants in the engineering trajectory of D-serFS. Values are normalised as a percentage of the same ratio for the sensor at 20 °C and are represented as mean ± s.e.m. (n = 3). The first sigmoidal transition in the data changes upon mutation to the binding protein while the second transition begins at ~ 90 °C for all variants. The second transition is not observed in full as the upper temperature limit for the experiment is 90 °C.

      3) The final construct D-SerFS has a dynamic range of only 7%, which is a low value. It seems that the FRET signal change caused by ligand binding to the construct is weak. Is it sufficient to reliably measure D-serine levels in-situ and in-vivo?

      First, we have modified the sensor, which now has a dynamic range of 14.7% (Figure 5, below). The magnitude of the change is reasonable for this sensor class; they function with relative low dynamic range because they are ratiometric sensors, i.e. they are accurate even with low dynamic range because of their ratiometric property. For example, the Gly-sensor GlyFS published in 2018 (Nature Chem. Biol.) has one of the highest dynamic ranges in this sensor class of only ~28%. The Glu sensor described by Okumuto et al., (2005) (PNAS, 102, 8740) has a dynamic range of ~9%. So, the FRET change is not a low value for ratiometric sensors of this class (which have been used very effectively for over a decade). Most importantly, the data from experiments with biological tissue and in vivo (Fig. 6) demonstrate a detectable (and statistically significant) response to changes in D-serine concentration in tissue.

      Figure 5. Characterization of full-length D-serFS. (A) Schematic showing the ECFP (blue), D-serFS binding protein (D-serFS BP; grey) and Venus (yellow) domains in D-serFS. The C-terminal residues of the Venus fluorescent protein sequence are labelled, showing the truncated (top) and full-length (bottom) C-terminal sequences. The underlined amino acids in truncated D-serFS represent residues introduced from the backbone vector sequence during cloning. Represents the STOP codon. (B) Sigmoidal dose response curves for truncated and full-length D-serFS with D-serine (n = 3). Values are the (475 nm/530 nm) fluorescence ratio as a percentage of the same ratio for the apo sensor. (C) Binding affinities (M) determined by fluorescence titration of truncated and full-length D-serFS, for glycine, D-alanine and D-serine (n = 3).*

      In Figure 5H in-vivo signal changes show large errors and the signal of the positive sample is hardly above error compared to the signal of the control.

      We have removed the in vivo data. Regardless, the comment is incorrect. Statistical analysis confirms that there is no significant change in the control (P = 0.08411), whereas the change for the sample with D-serine was significant to P = 0.00998.

      “H) ECFP/Venus ratio recorded in vivo in control recordings (left panel, baseline recording first, control recording after 10 minutes; paired two-sided Student’s t-test vs. baseline, t(6) = -2.07,P = 0.08411; n = 6 independent experiments) and during D-serine application (right panel, baseline recording first, second recording after D-serine injection, 1 mM; paired two-sided Student’s t-test vs. baseline, t(3) = -5.85,P = 0.00998; n = 4 independent experiments). Values are mean +- s.e.m. throughout. **P < 0.01.”

      Figure 5G is unclear. What does the fluorescence image show?

      We have removed the in-vivo data from the manuscript. However, Figure 6 in the original manuscript shows a schematic of how the sensor is applied to the brain for in-vivo experiments (biotin injection, followed by sensor injection and then imaging). The fluorescence image shows the detected Venus fluorescence following pressure loading of the sensor into the brain.

      Work presented in this manuscript that assesses functionality and applicability of the developed sensor in-situ and in-vivo is limited compared to the work showing its design. For example, control experiments showing FRET signal changes of the wild-type ECFP-DalS-Venus construct in comparison to the designed D-SerFS would be helpful to assess the outcome.

      Indeed, the in situ and in vivo work was never the focus of the study, which is already a large paper. To avoid confusion, the in vivo work is now omitted and the in situ work is present to show proof, in principle, that the sensor can be used to image D-serine. We reiterate – this is a protein engineering paper, not a neuroscience paper.

      4) The FRET spectra shown in Supplementary Figure 2, which exemplify the measurement of fluorescence ratios of ECFP/Venus, are confusing. I cannot see a significant change of FRET upon application of ligand. The ratios of the peak fluorescence intensities of ECFP and Venus (scanned from the data shown in Supplementary Figure 2) are the same for apo states and the ligand-saturated states. Instead what happens is that fluorescence emission intensities of both the donor and the acceptor bands are reduced upon application of ligand.

      We thank the reviewer for bringing this to our attention. The spectra were not normalised to account for the effect of dilution when saturating with ligand, giving rise to an observed decrease in emission intensity from both ECFP and Venus. We can also see how the figure is hard to interpret when both variants are displayed on the same axes, so we have separated them in an updated figure shown below and normalised the data as a percentage of the maximum emission intensity from ECFP at 475 nm. This has been changed in the supporting information of an updated manuscript. Hopefully it is now clear that there is a ratiometric change upon addition of ligand.

      Figure 3. Emission spectra (450 – 550 nm) of (A) LSQED and (B) LSQED-T197Y (LSQEDY) upon excitation of ECFP (lexc = 433 nm), normalised to the maximum emission intensity from ECFP (475 nm). For all sensor variants, the FRET efficiency decreases in response to saturation with D-serine (A, B; orange), leading to decreased emission from Venus (530 nm) relative to ECFP (475 nm). When comparing the apo states of LSQED and LSQEDY (A, B; dark green), it can be seen that the T197Y mutation results in a decreased Venus emission (lower FRET efficiency). This suggests a shift in the apo population of the sensor towards the spectral properties of the saturated, closed state and explains the decreased dynamic range of LSQEDY compared to LSQED. Values are mean ± s.e.m (n = 3).

      Reviewer #2:

      The authors describe the development and use of a D-Serine sensor based on a periplasmic ligand binding protein (DalS) from Salmonella enterica in conjunction with a FRET readout between enhanced cyan fluorescent protein and Venus fluorescent protein. They rationally identify point mutations in the binding pocket that make the binding protein somewhat more selective for D-serine over glycine and D-alanine. Ligand docking into the binding site, as well as algorithms for increasing the stability, identified further mutants with higher thermostability and higher affinity for D-serine. The combined computational efforts lead to a sensor for D-serine with higher affinity for D-serine (Kd = ~ 7 µM), but also showed affinity for the native D-alanine (Kd = ~ 13 uM) and glycine (Kd = ~40 uM). Molecular simulations were then used to explain how remote mutations identified in the thermostability screen could lead to the observed alteration of ligand affinity. Finally, the D-SerFS was tested in 2P-imaging in hippocampal slices and in anesthetized mice using biotin-straptavidin to anchor exogenously applied purified protein sensor to the brain tissue and pipetting on saturating concentrations of D-serine ligand.

      Although presented as the development of a sensor for biology, this work primarily focuses on the application of existing protein engineering techniques to alter the ligand affinity and specificity of a ligand-binding protein domain. The authors are somewhat successful in improving specificity for the desired ligand, but much context is lacking. For any such engineering effort, the end goals should be laid out as explicitly as possible. What sorts of biological signals do they desire to measure? On what length scale? On what time scale? What is known about the concentrations of the analyte and potential competing factors in the tissue? Since the authors do not demonstrate the imaging of any physiological signals with their sensor and do not discuss in detail the nature of the signals they aim to see, the reader is unable to evaluate what effect (if any) all of their protein engineering work had on their progress toward the goal of imaging D-serine signals in tissue.

      As a paper describing a combination of protein engineering approaches to alter the ligand affinity and specificity of one protein, it is a relatively complete work. In its current form trying to present a new fluorescent biosensor for imaging biology it is strongly lacking. I would suggest the authors rework the story to exclusively focus on the protein engineering or continue to work on the sensor/imaging/etc until they are able to use it to image some biology.

      Additional Major Points:

      1) There is no discussion of why the authors chose to use non-specific chemical labeling of the tissue with NHS-biotin to anchor their sensor vs. genetic techniques to get cell-type specific expression and localization. There is no high-resolution imaging demonstrating that the sensor is localized where they intended.

      We use non-specific chemical labelling for proof-of-concept experiments that show the sensor can respond to changes in D-serine concentration in the extracellular environment of brain tissue. Cell-type specific expression of the sensor is possible based on our previous development of a similar sensor for glycine (Zhang et al., 2018; doi: https://doi.org/10.1038/s41589-018-0108-2) where the sensor was expressed by HEK293 cells and neurons, and targeted to the membrane. However, this is beyond the scope of this manuscript. Figure 5G of the original manuscript shows that the sensor (identified by Venus fluorescence) is localized to the area where D-serFS is pressure-loaded into the brain.

      2) Why does the fluorescence of both the CFP and they YFP decrease upon addition of ligand (see e.g. Supplementary Figure 2)? Were these samples at the same concentration? Is this really a FRET sensor or more of an intensiometric sensor? Is this also true with 2P excitation? How does the Venus fluorescence change when Venus is excited directly? Perhaps fluorescence lifetime measurements could help inform what is happening.

      Please see response to major comments from reviewer #1 and Figure 3. We hope this clarifies that the sensor is ratiometric. The sensor behaves similarly under two-photon excitation (2PE) as shown in Figure 5A.

      3) How reproducible are the spectral differences between LSQED and LSQED-T197Y? Only one trace for each is shown in Supplementary Figure 2 and the differences are very small, but the authors use these data to draw conclusions about the protein open-closed equilibrium.

      We have updated this to show data points representing the mean ± s.e.m (n = 3).

      4) The first three mutations described are arrived upon by aligning DalS (which is more specific for D-Ala) with the NMDA receptor (which binds D-Ser). The authors then mutate two of the ligand pocket positions of DalS to the same amino acid found in NMDAR, but mutate the third position to glutamine instead of valine. I really can't understand why they don't even test Y148V if their goal is a sensor that hopefully detects D-Ser similar to the native NMDAR. I'm sure most readers will have the same confusion.

      Please see response to major comments from reviewer #1. Additionally, while the NR1 binding domain of the NMDAR was used a structural guide for rational design of the DalS binding site, the high affinity of the NMDAR for both D-serine and glycine was not desirable in a D-serine-specific sensor.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors ask an interesting question as to whether working memory contains more than one conjunctive representation of multiple task features required for a future response with one of these representations being more likely to become relevant at the time of the response. With RSA the authors use a multivariate approach that seems to become the standard in modern EEG research.

      We appreciate the reviewer’s helpful comments on the manuscript and their encouraging comments regarding its potential impact.

      I have three major concerns that are currently limiting the meaningfulness of the manuscript: For one, the paradigm uses stimuli with properties that could potentially influence involuntary attention and interfere in a Stroop-like manner with the required responses (i.e., 2 out of 3 cues involve the terms "horizontal" or "vertical" while the stimuli contain horizontal and vertical bars). It is not clear to me whether these potential interactions might bring about what is identified as conjunctive representations or whether they cause these representations to be quite weak.

      We agree it is important to rule out any effects of involuntary attention that might have been elicited by our stimulus choices. To address the Reviewer’s concern, we conducted control analyses to test if there was any influence of Stroop-like interference on our measures of behavior or the conjunctive representation. To summarize these analyses (detailed in our responses below and in the supplemental materials), we found no evidence of the effect of compatibility on behavior or on the decoding of conjunctions during either the maintenance or test periods. Furthermore, we found that the decoding of the bar orientation was at chance level during the interval when we observe evidence of the conjunctive representations. Thus, we conclude that the compatibility of the stimuli and the rule did not contribute to the decoding of conjunctive representations or to behavior.

      Second, the relatively weak conjunctive representations are making it difficult to interpret null effects such as the absence of certain correlations.

      The reviewer is correct that we cannot draw strong conclusions from null findings. We have revised the main text accordingly. In certain cases, we have also included additional analyses. These revisions are described in detail in response the reviewer’s comments below.

      Third, if the conjunctive representations truly are reflections of working memory activity, then it would help to include a control condition where memory load is reduced so as to demonstrate that representational strength varies as a function of load. Depending on whether these concerns or some of them can be addressed or ruled out this manuscript has the potential of becoming influential in the field.

      This is a clever suggestion for further experimentation. We agree that observing the adverse effect of memory load is one of the robust ways to assess the contributions of working memory system for future studies. However, given that decoding is noisy during the maintenance period (particularly for the low-priority conjunctive representation) even with a relatively low set-size, we expect that in order to further manipulate load, we would need to alter the research design substantially. Thus, as the main goal of the current study is to study prioritization and post-encoding selection of action-related information, we focused on the minimum set-size required for this question (i.e., load 2). However, we now note this load manipulation as a direction for future research in the discussion (pg. 18).

      Reviewer #2 (Public Review):

      Kikumoto and colleagues investigate the way visual-motor representations are stored in working memory and selected for action based on a retro-cue. They make use of a combination of decoding and RSA to assess at which stages of processing sensory, motor, and conjunctive information (consisting of sensory and motor representations linked via an S- R mapping) are represented in working memory and how these mental representations are related to behavioral performance.

      Strengths

      This is an elaborate and carefully designed experiment. The authors are able to shed further light on the type of mental representations in working memory that serve as the basis for the selection of relevant information in support of goal- directed actions. This is highly relevant for a better understanding of the role of selective attention and prospective motor representations in working memory. The methods used could provide a good basis for further research in this regard.

      We appreciate these helpful comments and the Reviewer’s positive comments on the impact of the work.

      Weaknesses

      There are important points requiring further clarification, especially regarding the statistical approach and interpretation of results.

      • Why is there a conjunction RSA model vector (b4) required, when all information for a response can be achieved by combining the individual stimulus, response, and rule vectors? In Figure 3 it becomes obvious that the conjunction RSA scores do not simply reflect the overlap of the other three vectors. I think it would help the interpretation of results to clearly state why this is not the case.

      Thank you for the suggestion, we’ve now added the theoretical background that motivates us to include the RSA model of conjunctive representation (pg. 4 and 5). In particular, several theories of cognitive control have proposed that over the course of action planning, the system assembles an event (task) file which binds all task features at all levels – including the rule (i.e., context), stimulus, and response – into an integrated, conjunctive representation that is essential for an action to be executed (Hommel 2019; Frings et al. 2020). Similarly, neural evidence of non-human primates suggests that cognitive tasks that require context-dependency (e.g., flexible remapping of inputs to different outputs based on the context) recruit nonlinear conjunctive representations (Rigotti et al. 2013; Parthasarathy et al. 2019; Bernardi et al. 2020; Panichello and Buschman, 2021). Supporting these views, we previously observed that conjunctive representations emerge in the human brain during action selection, which uniquely explained behavior such as the costs in transition of actions (Kikumoto & Mayr, 2020; see also Rangel & Hazeltine & Wessel, 2022) or the successful cancelation of actions (Kikumoto & Mayr, 2022). In the current study, by using the same set of RSA models, we attempted to extend the role of conjunctive representations for planning and prioritization of future actions. As in the previous studies (and as noted by the reviewer), the conjunction model makes a unique prediction of the similarity (or dissimilarity) pattern of the decoder outputs: a specific instance of action that is distinct from others actions. This contrasts to other RSA models of low-level features that predict similar patterns of activities for instances that share the same feature (e.g., S-R mappings 1 to 4 share the diagonal rule context). Here, we generally replicate the previous studies showing the unique trajectories of conjunctive representations (Figure 3) and their unique contribution on behavior (Figure 5).

      • One of the key findings of this study is the reliable representation of the conjunction information during the preparation phase while there is no comparable effect evident for response representations. This might suggest that two potentially independent conjunctive representations can be activated in working memory and thereby function as the basis for later response selection during the test phase. However, the assumption of the independence of the high and low priority conjunction representations relies only on the observation that there was no statistically reliable correlation between the high and low priority conjunctions in the preparation and test phases. This assumption is not valid because non-significant correlations do not allow any conclusion about the independence of the two processes. A comparable problem appeared regarding the non-significant difference between high and low-priority representations. These results show that it was not possible to prove a difference between these representations prior to the test phase based on the current approach, but they do not unequivocally "suggest that neither action plan was selectively prioritized".

      We appreciate this important point. We have taken care in the revision to state that we find evidence of an interference effect for the high-priority action and do not find evidence for such an effect from the low-priority action. Thus, we do not intend to conclude that no such effect could exist. Further, although it is not our intention to draw a strong conclusion from the null effect (i.e., no correlations), we performed an exploratory analysis where we tested the correlation in trials where we observed strong evidence of both conjunctions. Specifically, we binned trials into half within each time point and individual subject and performed the multi-level model analysis using trials where both high and low priority conjunctions were above their medians. Thus, we selected trials in such a way that they are independent of the effect we are testing. The figure below shows the coefficient of associated with low-priority conjunction predicting high-priority conjunction (uncorrected). Even when we focus on trials where both conjunctions are detected (i.e., a high signal-to-noise ratio), we observed no tradeoff. Again, we cannot draw strong conclusions based on the null result of this exploratory analysis. Yet, we can rule out some causes of no correlation between high and low priority conjunctions such as the poor signal-to-noise ratio of the low priority conjunctions. We have further clarified this point in the result (pg. 14).

      Fig. 1. Trial-to-trial variability between high and low priority conjunctions, using above median trials. The coefficients of the multilevel regression model predicting the variability in trial-to-trial highpriority conjunction by low-priority conjunction.

      • The experimental design used does not allow for a clear statement about whether pure motor representations in working memory only emerge with the definition of the response to be executed (test phase). It is not evident from Figure 3 that the increase in the RSA scores strictly follows the onset of the Go stimulus. It is also conceivable that the emergence of a pure motor representation requires a longer processing time. This could only be investigated through temporally varying preparation phases.

      We agree with the reviewer. Although we detected no evidence of response representations of both high and low priority action plans during the preparation phase, t(1,23) = -.514, beta = .002, 95% CI [-.010 .006] for high priority; t(1,23) = -1.57, beta = -.008, 95% CI [-.017 .002] for low priority, this may be limited by the relatively short duration of the delay period (750 ms) in this study. However, in our previous studies using a similar paradigm without a delay period (Kikumoto & Mayr, 2020; Kikumoto & Mayr, 2022), response representations were detected less than 300ms after the response was specified, which corresponds to the onset of delay period in this study. Further, participants in the current study were encouraged to prepare responses as early as possible, using adaptive response deadlines and performance-based incentives. Thus, we know of no reason why responses would take longer to prepare in the present study. But we agree that we can’t rule this out. We have added the caveat noted above, as well as this additional context in the discussion (pg. 16-17).

      • Inconsistency of statistical approaches: In the methods section, the authors state that they used a cluster-forming threshold and a cluster-significance threshold of p < 0.05. In the results section (Figure 4) a cluster p-value of 0.01 is introduced. Although this concerns different analyses, varying threshold values appear as if they were chosen in favor of significant results. The authors should either proceed consistently here or give very good reasons for varying thresholds.

      We thank the reviewer for noting this oversight. All reported significant clusters with cluster P-value were identified using a cluster-forming threshold, p < .05. We fixed the description accordingly.

      • Interpretation of results: The significant time window for the high vs. low priority by test-type interaction appeared quite late for the conjunction representation. First, it does not seem reasonable that such an effect appears in a time window overlapping with the motor responses. But more importantly, why should it appear after the respective interaction for the response representation? When keeping in mind that these results are based on a combination of time-frequency analysis, decoding, and RSA (quite many processing steps), I find it hard to really see a consistent pattern in these results that allows for a conclusion about how higher-level conjunctive and motor representations are selected in working memory.

      Thank you for raising this important point. First, we fixed reported methodological inconsistencies such as the cluster P-value and cluster-forming threshold). Further, we fully agree that the difference in the time course for the response and conjunctive representations in the low priority, tested condition is unexpected and would complicate the perspective that the conjunctive representation contributes to efficient response selection. However, additional analysis indicates that this apparent pattern in the stimulus locked result is misleading and there is a more parsimonious explanation. First, we wish to caution that the data are relatively noisy and likely are influenced by different frequency bands for different features. Thus, fine-grained temporal differences should be interpreted with caution in the absence of positive statistical evidence of an interaction over time. Indeed, though Figure 4 in the original submission shows a quantitative difference in timing of the interaction effect (priority by test type) across conjunctive representation and response representation, the direct test of this four way interaction [priority x test type x representation type (conjunction vs. response), x time interval (1500 ms to 1850 ms vs. 1850 to 2100 ms)] is not significant, t(1,23) = 1.65, beta = .058, 95% CI [-.012 .015]). The same analysis using response-aligned data is also not significant, t(1,23) = -1.24, beta = -.046, 95% CI [-.128 .028]). These observations were not dependent on the choice of time interval, as other time intervals were also not significant. Therefore, we do not have strong evidence that this is a true timing difference between these conditions and believe this is likely driven by noise.

      Further, we believe the apparent late emergence of difference in two conjunctions when the low priority action is tested is more likely due to a slow decline in the strength of the untested high priority conjunction rather than a late emergence of the low priority conjunction. This pattern is clearer when the traces are aligned to the response. The tested low priority conjunction emerges early and is sustained when it is the tested action and declines when it is untested (-226 ms to 86 ms relative to the response onset, cluster-forming threshold, p < .05). These changes eventually resulted in a significant difference in strength between the tested versus untested low priority conjunctions just prior to the commission of the response (Figure 4 - figure supplement 1, the panel on right column of the middle row, the black bars at the top of panel). Importantly, the high priority conjunction also remains active in its untested condition and declines later than the untested low priority conjunction does. Indeed, the untested high priority conjunction does not decline significantly relative to trials when it is tested until after the response is emitted (Figure 4 - figure supplement 1, the panel on right column of the middle row, the red bars at the top of panel). This results in a late emerging interaction effect of the priority and test type, but this is not due to a late emerging low priority conjunctive representation.

      In summary, we do not have statistical evidence of a time by effect interaction that allows us to draw strong inferences about timing. Nonetheless, even the patterns we observe are inconsistent with a late emerging low priority conjunctive representation. And if anything, they support a late decline in the untested high priority conjunctive representation. This pattern of the result of the high priority conjunction being sustained until late, even when it is untested, is also notable in light of our observation that the strength of the high priority conjunctive representation interferes behavior when the low priority item is tested, but not vice versa. We now address this point about the timing directly in the results (pg. 15-16) and the discussion (pg. 21), and we include the response locked results in the main text along with the stimulus locked result including exploratory analyses reported here.

      Reviewer #3 (Public Review):

      This study aims to address the important question of whether working memory can hold multiple conjunctive task representations. The authors combined a retro-cue working memory paradigm with their previous task design that cleverly constructed multiple conjunctive tasks with the same set of stimuli, rules, and responses. They used advanced EEG analytical skills to provide the temporal dynamics of concurrent working memory representation of multiple task representations and task features (e.g., stimulus and responses) and how their representation strength changes as a function of priority and task relevance. The results generally support the authors' conclusion that multiple task representations can be simultaneously manipulated in working memory.

      We appreciate these helpful comments, and were pleased that the reviewer shares our view that these results may be broadly impactful.

    1. Author Response

      Reviewer #2 (Public Review):

      Reviewer #2 was critical of every aspect of our manuscript and we were disappointed that they failed to appreciate the significance of our findings. However, we have responded to each point as described below:

      1) The experiment displayed in Figure 5 is deeply flawed for multiple reasons and should be removed from the manuscript entirely. A Michaelis-Menton plot compares the initial rate of a reaction versus substrate concentration. Instead, the authors plotted the fraction of SsrB that is phosphorylated after 10 minutes at various substrate concentrations. Such a plot must reach saturation because the enzyme is limiting, whereas it is not always possible to achieve saturation in a genuine Michaelis-Menton plot. Because no reaction rates were measured, it is not possible to derive kcat values from the data.

      Mea culpa. We now plot our phosphorylation data and describe the mid-point as a k0.5 and have removed Fig. 1g. When we directly compare the H12 mutant to wt at neutral pH, its phosphorylation level is less compared to the wt (see new Fig. 4a). The wt phosphorylation is reduced at acid pH, (Fig 4b), but with His12Q, there was no difference in phosphorylation between neutral and acid pH (Fig 4c). It is important to include this data, because in RcsB, a close homolog of SsrB, an H12A mutant was not phosphorylated by acetyl phosphate and it was incapable of binding to DNA, unlike what we show here with SsrB.

      (i) Increasing the concentration of the phosphoramidite substrate increased ionic strength. Response regulator active sites contain many charged moieties and autophosphorylation of at least one response regulator (CheY) is inhibited by increasing ionic strength (PMID 10471801).

      The reviewer raises some interesting points and they are based on CheY phosphorylation by small molecules. We have a long history of studying OmpR and SsrB as well as other RRs and we know that they can all behave very differently from “canonical signaling”. We examined the effect of ionic strength on SsrB phosphorylation and it was relatively insensitive to changes in ionic strength (our original buffer was 267-430 mOsm and in each case, we have 90% phosphorylation). However, we repeated all of the phosphorylation experiments and kept ionic strength constant. These data are now presented in the revised manuscript.

      (ii) Autophosphorylation with phosphoramidite is pH dependent because the nitrogen on the donor must be protonated to form a good leaving group (PMID 9398221). The pKa of phosphoramidite is ~8. Therefore, the fraction of phosphoramidite that is reactive (i.e., protonated) will be very different at pH 6.1 and 7.4.

      We are aware of those findings, but we are comparing the H12 mutant with the wt protein in each case. There is no reason to believe that the presence of the mutant should alter the phosphoramidate substrate, so we are comparing how the wt phosphorylation compares with the mutant (Fig 4b, c).

      (iii) Response regulator autophosphorylation absolutely depends on the presence of a divalent metal ion (usually Mg2+) in the active site (PMID 2201404). There is no guarantee that the 20 mM Mg2+ included in the reaction is sufficient to saturate SsrB. Furthermore, as the authors themselves note, the amino acid at SsrB position 12 is likely to affect the affinity of Mg2+ binding. Therefore, the fraction of SsrB that is reactive (i.e. has Mg2+ bound) may differ between wildtype and the H12Q mutant, and/or between wildtype at different pHs (because the protonation state of His12 changes).

      This is exactly the point that we are making. And why we varied the magnesium concentration (increasing to 50-100 mM). There was a slight increase in phosphorylation at 50 mM MgCl2 compared to 20 mM, and only a slight increase between 50 and 100 mM at pH 6.1. The revised phosphorylation experiments all contain 100 mM MgCl2.

      2) The data in Figures 1abcd and 3de are clearly sigmoidal rather than hyperbolic, indicating cooperativity. However, there are insufficient data points between the upper and lower bounds to accurately calculate the Hill coefficient or KD values. This limitation of the data means that comparisons of apparent Hill coefficient or KD values under different conditions cannot be the basis of credible conclusions.

      We respectfully disagree. In every curve that we provide, there is at least one data point in the transition between low and high binding. With the mutant H12Q, we did manage to get two data points in the transition and the KD was the same as the wildtype (Fig. 2). We provide an analysis of the binding curve which nicely demonstrates the range of KD values based on the lowest and highest error in the point (132-168 nM) and it doesn’t significantly change the value (this is now shown in Fig.1– figure supplement 1). The very high affinity we observed at pH 6.1 (KD ~5 nM) makes the range of possibilities between 4-8 nM (i.e. still VERY high affinity). These range in affinities at neutral and acid pH are very reminiscent of affinities we measured for OmpR and OmpR~P at the porin promoters, suggesting that acid pH puts SsrB in an activated state even in the absence of phosphorylation. A similar argument holds for the Hill coefficient (see Figure).

      3) There are hundreds of receiver domain structures in PDB. There is some variation, but to a first approximation receiver domain structures, all exhibit an (alpha/beta)5 fold. The structure of SsrB predicted by i-TASSER breaks the standard beta-2 strand into two parts, which throws off the numbering for subsequent beta strands. Given the highly conserved receiver domain fold, I am skeptical that the predicted i-TASSER structure is correct or adds any value to the manuscript. If the authors wish to retain the structure of the manuscript, then they should point out the unusual feature and the consequence of strand numbering.

      We now include a new model based on the RcsB/DNA crystal structure that eliminates this problem (see new Fig.2– figure supplement 2). We have replaced this model with an Alphafold prediction that was energy minimized to align with the RcsB dimer crystal structure (Fig.5– figure supplement 2). This model retains the original (beta/alpha)5 fold, so the classical numbering is retained.

      4) The detailed predictions of active site structure in Supplementary Figure 5 are not physiologically relevant because Mg2+ was not included in the simulation. The presence of a divalent cation binding to Asp10 and Asp11 is likely to substantially alter interactions between Asp 10, Asp11, His12, and Lys109.

      See response to 1iii, above and new Fig.5– figure supplement 2. Author response image 1 is a zoomed-in snapshot of supplementary Figure 8c that has been modelled using the RcsB dimer bound to BeF3 and Mg2+(6ZIX). Both the i-TASSER and Alphafold model receiver domains align well with this structure, and the polar contacts and pi-cation interactions made by His12 are maintained.

      Author response image 1.

      5) The authors present an AlphaFold model of an SsrB dimer, and note that His12 is at the dimer interface. However, the authors also believe that a higher-order oligomer of SsrB binds to DNA in a pH-dependent manner. Do the authors have any suggestions or informed speculation about how His12 might affect higher-order oligomerization than dimerization?

      As mentioned to point 3, above, we now include a new model of an SsrB dimer bound to DNA based on our NMR structure of the CTD and the RcsB/DNA structure. In the RcsB paper, they also have evidence for a higher-order oligomer in the crystal structure of unphosphorylated (and BeF3-) RcsB, which showed an asymmetric unit containing 6 molecules of RcsB, which form 3 dimers arranged in a hexameric structure that resembles a cylinder. This configuration involves a crossed conformation with the REC of one molecule interacting with the DBD of another and interestingly, His12 is interacting with the DBD of another molecule. We modelled an SsrB oligomer structure using the RcsB hexamer as a template and have included it as a new figure (see Fig.5– figure supplement 3) and in the revised discussion (lines 432-448).

    1. Author Response

      Reviewer #1 (Public Review):

      1) One nagging concern is that the category structure in the CNN reflects the category structure baked into color space. Several groups (e.g. Regier, Zaslavsky, et al) have argued that color category structure emerges and evolves from the structure of the color space itself. Other groups have argued that the color category structure recovered with, say, the Munsell space may partially be attributed to variation in saturation across the space (Witzel). How can one show that these properties of the space are not the root cause of the structure recovered by the CNN, independent of the role of the CNN in object recognition?

      We agree that there is overlap with the previous studies on color structure. In our revision, we show that color categories are directly linked to the CNN being trained on the objectrecognition task and not the CNN per se. We repeated our analysis on a scene-trained network (using the same input set) and find that here the color representation in the final layer deviates considerably from the one created for object classification. Given the input set is the same, it strongly suggests that any reflection of the structure of the input space is to the benefit of recognizing objects (see the bottom of “Border Invariance” section; Page 7). Furthermore, the new experiments with random hue shifts to the input images show that in this case stable borders do not arise, as might be expected if the border invariance was a consequence of the chosen color space only.

      A crucial distinction to previous results is also, is that in our analysis, by replacing the final layer, specifically, we look at the representation that the network has built to perform the object classification task on. As such the current finding goes beyond the notion that the color category structure is already reflected in the color space.

      2) In Figure 1, it could be useful to illustrate the central observation by showing a single example, as in Figure 1 B, C, where the trained color is not in the center of the color category. In other words, if the category structure is immune to the training set, then it should be possible to set up a very unlikely set of training stimuli (ones that are as far away from the center of the color category while still being categorized most of the time as the color category). This is related to what is in E, but is distinctive for two reasons: first, it is a post hoc test of the hypothesis recovered in the data-driven way by E; and second, it would provide an illustration of the key observation, that the category boundaries do not correspond to the median distance between training colors. Figure 5 begins to show something of this sort of a test, but it is bound up with the other control related to shape.

      We have now added a post-hoc test where we shift the training bands from likely to unlikely positions using the original paradigm: Retraining output layers whilst shifting training bands from the left to the right category-edge (in 9 steps) we can see the invariance to the category bounds specifically (see Supp. Inf.: Figure S11). The most extreme cases (top and bottom row) have the training bands right at the edge of the border, which are the interesting cases the reviewer refers to. We also added 7 steps in between to show how the borders shift with the bands.

      Similarly, if the claim is that there are six (or seven?) color categories, regardless of the number of colors used to train the data, it would be helpful to show the result of one iteration of the training that uses say 4 colors for training and another iteration of the training that uses say 9 colors for training.

      We have now included the figure presented in 1E, but for all the color iterations used (see SI: Figure S10. We are also happy to include a single iteration, but believe this gives the most complete view for what the reviewer is asking.

      The text asserts that Figure 2 reflects training on a range of color categories (from 4 to 9) but doesn’t break them out. This is an issue because the average across these iterations could simply be heavily biased by training on one specific number of categories (e.g. the number used in Figure 1). These considerations also prompt the query: how did you pick 4 and 9 as the limits for the tests? Why not 2 and 20? (the largest range of basic color categories that could plausibly be recovered in the set of all languages)?

      The number of output nodes was inspired by the number of basic color categories that English speakers observe in the hue spectrum (in which a number of the basic categories are not represented). We understand that this is not a strong reason, however, unfortunately the lack of studies on color categories in CNNs forced us to approach this in an explorative manner. We have adapted the text to better reflect this shortcoming (Bottom page 4). Naturally if the data would have indicated that these numbers weren’t a good fit, we would have adapted the range. (if there were more categories, we would have expected more noise and we would have increased the number of training bands to test this). As indicated above, we have now also included the classification plots for all the different counts, so the reader can review this as well (SI: Section 9).

      3) Regarding the transition points in Figure 2A, indicated by red dots: how strong (transition count) and reliable (consistent across iterations) are these points? The one between red and orange seems especially willfully placed.

      To answer the question on the consistency we have now included a repetition of the ResNet18, with the ResNet34, ResNet50 and ResNet101 in the SI (section 1). We have also introduced a novel section presenting the result of alternate CNNs to the SI (section S8). Despite small idiosyncrasies the general pattern of results recurs.

      Concerning the red-orange border, it was not willfully placed, but we very much understand that in isolation it looks like it could simply be the result of noise. Nevertheless, the recurrence of this border in several analyses made us confident that it does reflect a meaningful invariance. Notably:

      • We find a more robust peak between red and orange in the luminance control (SI section 3).

      • The evolutionary algorithm with 7 borders also places a border in this position.

      • We find the peak recurs in the Resnet-18 replication as well as several of the deeper ResNets and several of the other CNNs (SI section 1)

      • We also find that the peak is present throughout the different layers of the ResNet-18.

      4) Figure 2E and Figure 5B are useful tests of the extent to which the categorical structure recovered by the CNNs shifts with the colors used to train the classifier, and it certainly looks like there is some invariance in category boundaries with respect to the specific colors uses to train the classifier, an important and interesting result. But these analyses do not actually address the claim implied by the analyses: that the performance of the CNN matches human performance. The color categories recovered with the CNN are not perfectly invariant, as the authors point out. The analyses presented in the paper (e.g. Figure 2E) tests whether there is as much shift in the boundaries as there is stasis, but that’s not quite the test if the goal is to link the categorical behavior of the CNN with human behavior. To evaluate the results, it would be helpful to know what would be expected based on human performance.

      We understand the lack of human data was a considerable shortcoming of the previous version of the manuscript. We have now collected human data in a match-to-sample task modeled on our CNN experiment. As with the CNN we find that the degree of border invariance does fluctuate considerably. While categorical borders are not exact matches, we do broadly find the same category prototypes and also see that categories in the red-to-yellow range are quite narrow in both humans and CNNs. Please, see the new “Human Psychophysics” (page 8) addition in the manuscript for more details.

      5) The paper takes up a test of color categorization invariant to luminance. There are arguments in the literature that hue and luminance cannot be decoupled-that luminance is essential to how color is encoded and to color categorization. Some discussion of this might help the reader who has followed this literature.

      We have added some discussion of the interaction between luminance and color categories (e.g., Lindsay & Brown, 2009) at the bottom of page 6/ top of page 7. The current analysis mainly aimed at excluding that the borders are solely based on luminance.

      Related, the argument that “neighboring colors in HSV will be neighboring colors in the RGB space” is not persuasive. Surely this is true of any color space?

      We removed the argument about “neighboring colors”. Our procedure requires the use of a hue spectrum that wraps around the color space while including many of the highly saturated colors that are typical prototypes for human color categories. We have elected to use the hue spectrum from the HSV color space at full saturation and brightness, which is represented by the edges of the RGB color cube. As this is the space in which our network was trained, it does not introduce any deformations into the color space. Other potential choices of color space either include strong non-linear transformations that stretch and compress certain parts of the RGB cube, or exclude a large portion of the RGB gamut (yellow in particular).

      We have adapted the text to better reflect our reasoning (page 6, top of paragraph 2).

      6) The paper would benefit from an analysis and discussion of the images used to originally train the CNN. Presumably, there are a large number of images that depict manmade artificially coloured objects. To what extent do the present results reflect statistical patterns in the way the images were created, and/or the colors of the things depicted? How do results on color categorization that derive from images (e.g. trained with neural networks, as in Rosenthal et al and presently) differ (or not) from results that derive from natural scenes (as in Yendrikhovskij?).

      We initially hoped we could perhaps analyze differences between colors in objects and background like in Rosenthal, unfortunately in ImageNet we did not find clear differences between pixels in the bounding boxes of objects provided with ImageNet and pixels outside these boxes (most likely because the rectangular bounding boxes still contain many background pixels). However, if we look at the results from the K-means analysis presented in Figure S6 (Suppl. Inf.) of the supplemental materials and the color categorization throughout the layers in the objecttrained network (end of the first experiment on page 7) as well as the color categorization in humans (Human Psychophysics starting on page 8), we see very similar border positions arise.

      7) It could be quite instructive to analyze what's going on in the errors in the output of the classifiers, as e.g. in Figure 1E. There are some interesting effects at the crossover points, where the two green categories seem to split and swap, the cyan band (hue % 20) emerges between orange and green, and the pink/purple boundary seems to have a large number of green/blue results. What is happening here?

      One issue with training the network on the color task, is that we can never fully guarantee that the network is using color to resolve the task and we suspected that in some cases the network may rely on other factors as well, such as luminance. When we look at the same type of plots for the luminance-controlled task (see below left) presented in the supplemental materials we do not see these transgressions. Also, when we look at versions of the original training, but using more bands, luminance will be less reliable and we also don’t see these transgressions (see right plot below).

      8) The second experiment using an evolutionary algorithm to test the location of the color boundaries is potentially valuable, but it is weakened because it pre-determines the number of categories. It would be more powerful if the experiment could recover both the number and location of the categories based on the "categorization principle" (colors within a category are harder to tell apart than colors across a color category boundary). This should be possible by a sensible sampling of the parameter space, even in a very large parameter space.

      The main point of the genetic algorithm was to see whether the border locations would be corroborated by an algorithm using the principle of categorical perception. Unfortunately, an exact approach to determining the number of borders is difficult, because some border invariances are clearly stronger than others. Running the algorithm with the number of borders as a free parameter just leads to a minimal number of borders, as 100% correct is always obtained when there is only one category left. In general, as the network can simply combine categories into a class at no cost (actually, having less borders will reduce noise) it is to be expected that less classes will lead to better performance. As such, in estimating what the optimal category count would be, we would need to introduce some subjective trade-off between accuracy and class count.

      9) Finally, the paper sets itself up as taking "a different approach by evaluating whether color categorization could be a side effect of learning object recognition", as distinct from the approach of studying "communicative concepts". But these approaches are intimately related. The central observation in Gibson et al. is not the discovery of warm-vscool categories (these as the most basic color categories have been known for centuries), but rather the relationship of these categories to the color statistics of objects-those parts of the scene that we care about enough to label. This idea, that color categories reflect the uses to which we put our color-vision system, is extended in Rosenthal et al., where the structure of color space itself is understood in terms of categorizing objects versus backgrounds (u') and the most basic object categorization distinction, animate versus inanimate (v'). The introduction argues, rightly in our view, that "A link between color categories and objects would be able to bridge the discrepancy between models that rely on communicative concepts to incorporate the varying usefulness of color, on the one hand, and the experimental findings laid out in this paragraph on the other". This is precisely the link forged by the observation that the warmcool category distinction in color naming correlates with object-color statistics (Gibson, 2017; see also Rosenthal et al., 2018). The argument in Gibson and Rosenthal is that color categorization structure emerges because of the color statistics of the world, specifically the color statistics of the parts of the world that we label as objects, which is the same approach adopted by the present work. The use of CNNs is a clever and powerful test of the success of this approach.

      We are sorry we did not properly highlight the enormous importance of these two earlier papers in our previous version of the manuscript. We have now elaborated our description of Gibson’s work to better reflect the important relation between the usefulness of colors and color categories (Page 2, middle and Page 19 par. above methods). We think our work nicely extends the earlier work by showing that their approach works even at a more general level with more color categories,

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Abdellatef et al. describe the reconstitution of axonemal bending using polymerized microtubules (MTs), purified outer-arm dyneins, and synthesized DNA origami. Specifically, the authors purified axonemal dyneins from Chlamydomonas flagella and combined the purified motors with MTs polymerized from purified brain tubulin. Using electron microscopy, the authors demonstrate that patches of dynein motors of the same orientation at both MT ends (i.e., with their tails bound to the same MT) result in pairs of MTs of parallel alignment, while groups of dynein motors of opposite orientation at both MT ends (i.e., with the tails of the dynein motors of both groups bound to different MTs) result in pairs of MTs with anti-parallel alignment. The authors then show that the dynein motors can slide MTs apart following photolysis of caged ATP, and using optical tweezers, demonstrate active force generation of up to ~30 pN. Finally, the authors show that pairs of anti-parallel MTs exhibit bidirectional motion on the scale of ~50-100 nm when both MTs are cross-linked using DNA origami. The findings should be of interest for the cytoskeletal cell and biophysics communities.

      We thank the reviewer for these comments.

      We might be misunderstanding this reviewer’s comment, but the complexes with both parallel and anti-parallel MTs had dynein molecules with their tails bound to two different MTs in most cases, as illustrated in Fig.2 – suppl.1. The two groups of dyneins produce opposing forces in a complex with parallel MTs, and majority of our complexes had parallel arrangement of the MTs. To clarify the point, we have modified the Abstract:

      “Electron microscopy (EM) showed pairs of parallel MTs crossbridged by patches of regularly arranged dynein molecules bound in two different orientations depending on which of the MTs their tails bind to. The oppositely oriented dyneins are expected to produce opposing forces when the pair of MTs have the same polarity.”

      Reviewer #2 (Public Review):

      Motile cilia generate rhythmic beating or rotational motion to drive cells or produce extracellular fluid flow. Cilia is made of nine microtubule doublets forming a spoke-like structure and it is known that dynein motor proteins, which connects adjacent microtubule doublet, are the driving force of ciliary motion. However the molecular mechanism to generate motion is still unclear. The authors proved that a pair of microtubules stably linked by DNA-origami and driven by outer dynein arms (ODA) causes beating motion. They employed in vitro motility assay and negative stain TEM to characterize this complex. They demonstrated stable linking of microtubules and ODAs anchored on the both microtubules are essential for oscillatory motion and bending of the microtubules.

      Strength

      This is an interesting work, addressing an important question in the motile cilia community: what is the minimum system to generate a beating motion? It is an established fact that dynein power stroke on the microtubule doublet is the driving force of the beating motion. It was also known that the radial spoke and the central pair are essential for ciliary motion under the physiological condition, but cilia without radial spokes and the central pair can beat under some special conditions (Yagi and Kamiya, 2000). Therefore in the mechanistic point of view, they are not prerequisite. It is generally thought that fixed connection between adjacent microtubules by nexin converts sliding motion of dyneins to bending, but it was never experimentally investigated. Here the authors successfully enabled a simple system of nexin-like inter-microtubule linkage using DNA origami technique to generate oscillatory and beating motions. This enables an interesting system where ODAs form groups, anchored on two microtubules, orienting oppositely and therefore cause tag-of-war type force generation. The authors demonstrated this system under constraints by DNA origami generates oscillatory and beating motions.

      The authors carefully coordinated the experiments to demonstrate oscillations using optical tweezers and sophisticated data analysis (Fourier analysis and a step-finding algorithm). They also proved, using negative stain EM, that this system contains two groups of ODAs forming arrays with opposite polarity on the parallel microtubules. The manuscript is carefully organized with impressive movies. Geometrical and motility analyses of individual ODAs used for statistics are provided in the supplementary source files. They appropriately cited similar past works from Kamiya and Shingyoji groups (they employed systems closer to the physiological axoneme to reproduce beating) and clarify the differences from this study.

      We thank the reviewer for these comments.

      Weakness

      The authors claim this system mimics two pairs of doublets at the opposite sites from 9+2 cilia structure by having two groups of ODAs between two microtubules facing opposite directions within the pair. It is not exactly the case. In the real axoneme, ODA makes continuous array along the entire length of doublets, which means at any point there are ODAs facing opposite directions. In their system, opposite ODAs cannot exist at the same point (therefore the scheme of Dynein-MT complex of Fig.1B is slightly misleading).

      Actually, opposite ODAs can exist at the same point in our system as well, and previous work using much higher concentration of dyneins (e.g, Oda et al., J. Cell biol., 2007) showed two continuous arrays of dynein molecules between a pair of microtubules. To observe the structures of individual dynein molecules we used low concentrations of dynein and searched for the areas where dynein could be observed without superposition, but there were some areas where opposite dyneins existed at the same point.

      We realize that we did not clearly explain this issue, so we have revised the text accordingly.

      In the 1st paragraph of Results: “In the dynein-MT complexes prepared with high concentrations of dynein, a pair of MTs in bundles are crossbridged by two continuous arrays of dynein, so that superposition of two rows of dynein molecules is observed in EM images (Haimo et al., 1979; Oda et al., 2007). On the other hand, when a low concentration of the dynein preparation (6.25–12.5 µg/ml (corresponding to ~3-6 nM outer-arm dynein)) was mixed with 20-25 µg/ml MTs (200-250 nM tubulin dimers), the MTs were only partially decorated with dynein, so that we were able to observe single layers of crossbridges without superposition in many regions.” Legend of Fig. 1(C): “Note that the geometry of dyneins in the dynein-MT complex shown in (B) mimics that of a combination of the dyneins on two opposite sides of the axoneme (cyan boxes), although the dynein arrays in (B) are not continuous.”

      If they want to project their result to the ciliary beating model, more insight/explanation would be necessary. For example, arrays of dyneins at certain positions within the long array along one doublet are activated and generate force, while dyneins at different positions are activated on another doublet at the opposite site of the axoneme. This makes the distribution of dyneins and their orientations similar to the system described in this work. Such a localized activation, shown in physiological cilia by Ishikawa and Nicastro groups, may require other regulatory proteins.

      We agree that the distributions of activated dyneins in 3D are extremely important in understanding ciliary beating, and that other regulatory proteins would be required to coordinate activation in different places in an axoneme. However, the main goal of this manuscript is to show the minimal components for oscillatory movements, and we feel that discussing the distributions of activated dyneins along the length of the MTs would be too complicated and beyond the scope of this study.

      They attempted to reveal conformational change of ODAs induced by power stroke using negative stain EM images, which is less convincing compared to the past cryo-ET works (Ishikawa, Nicastro, Pigino groups) and negative stain EM of sea urchin outer dyneins (Hirose group), where the tail and head parts were clearly defined from the 3D map or 2D averages of two-dynein ODAs. Probably three heavy chains and associated proteins hinder detailed visualization of the tail structure. Because of this, Fig.2C is not clear enough to prove conformational change of ODA. This reviewer imagines refined subaverage (probably with larger datasets) is necessary.

      As the reviewer suggests, one of the reasons for less clear averaged images compared to the past images of sea urchin ODA is the three-headed structure of Chlamydomonas ODA. Another and perhaps the bigger reason is the difficulty of obtaining clear images of dynein molecules bound between 2 MTs by negative stain EM: the stain accumulates between MTs that are ~25 nm in diameter and obscures the features of smaller structures. We used cryo-EM with uranyl acetate staining instead of negative staining for the images of sea urchin ODA-MT complexes we previously published (Ueno et al., 2008) in order to visualize dynein stalks. We agree with the reviewer that future work with larger datasets and by cryo-ET is necessary for revealing structural differences.

      That having been said, we did not mean to prove structural changes, but rather intended to show that our observation suggests structural changes and thus this system is useful for analyzing structural changes in future. In the revised manuscript, we have extensively modified the parts of the paper discussing structural changes (Please see our response to the next comment).

      It is not clear, from the inset of Fig.2 supplement3, how to define the end of the tail for the length measurement, which is the basis for the authors to claim conformational change (Line263-265). The appearance of the tail would be altered, seen from even slightly different view angles. Comparison with 2D projection from apo- and nucleotide-bound 3-headed ODA structures from EM databank will help.

      We agree with the reviewer that difference in the viewing angle affects the apparent length of a dynein molecule, although the 2 MTs crossbridged by dyneins lie on the carbon membrane and thus the variation in the viewing angle is expected to be relatively small. To examine how much the apparent length is affected by the view angle, we calculated 2D-projected images of the cryo-ET structures of Chlamydomonas axoneme (emd_1696 and emd_1697; Movassagh et al., 2010) with different view angles, and measured the apparent length of the dynein molecule using the same method we used for our negative-stain images (Author response image 1). As shown in the plot, the effect of view angles on the apparent lengths is smaller than the difference between the two nucleotide states in the range of 40 degrees measured here. Thus, we think that the length difference shown in Fig.2-suppl.4 reflects a real structural difference between no-ATP and ATP states. In addition, it would be reasonable to think that distributions of the view angles in the negative stain images are similar for both absence and presence of ATP, again supporting the conclusion.

      Nevertheless, since we agree with the reviewer that we cannot measure the precise length of the molecule using these 2D images, we have revised the corresponding parts of the manuscript, adding description about the effect of view angles on the measured length in the manuscript.

      Author response image 1. Effects of viewing angles on apparent length. (A) and (B) 2D-projected images of cryo-electron tomograms of Chlamydomonas outer arm dynein in an axoneme (Movassagh et al., 2010) viewed from different angles. (C) apparent length of the dynein molecule measured in 2D-projected images.

      In this manuscript, we discuss two structural changes: 1) a difference in the dynein length between no-nucleotide and +ATP states (Fig.2-suppl.4), and 2) possible structural differences in the arrangement of the dynein heads (Fig.2-suppl.3). Although we realize that extensive analysis using cryo-ET is necessary for revealing the second structural change, we attempted to compare the structures of oppositely oriented dyneins, hoping that it would lead to future research. In the revised manuscript, we have added 2D projection images of emd_1696 and emd_1697 in Fig.2-suppl.3, so that the readers can compare them with our negative stain images. We had an impression that some of our 2D images in the presence of ATP resembled the cryo-ET structure with ADP.Vi, whereas some others appeared to be closer to the no-nucleotide cryo-ET structure. We have also attempted to calculate cross-correlations, but difficulties in removing the effect of MTs sometimes overlapped with a part of dynein, adjusting the magnifications and contrast of different images prevented us from obtaining reliable results.

      To address this and the previous comments, we have extensively modified the section titled ‘Structures of dynein in the dynein-MT-DNA-origami complex’.

      In Fig.5B (where the oscillation occurs), the microtubule was once driven >150nm unidirectionally and went back to the original position, before oscillation starts. Is it always the case that relatively long unidirectional motion and return precede oscillation? In Fig.7B, where the authors claim no oscillation happened, only one unidirectional motion was shown. Did oscillation not happen after MT returned to the original position?

      Long unidirectional movement of ~150 nm was sometimes observed, but not necessarily before the start of oscillation. For example, in Figure 5 – figure supplement 1A, oscillation started soon after the UV flash, and then unidirectional movement occurred.

      With the dynein-MT complex in which dyneins are unidirectionally aligned (Fig.7B, Fig.7-suppl.2), the MTs kept moving and escaped from the trap or just stopped moving probably due to depletion of ATP, so we did not see a MT returning to the original position.

      Line284-290: More characterization of bending motion will be necessary (and should be possible). How high frequency is it? Do they confirm that other systems (either without DNA-origami or without ODAs arraying oppositely) cannot generate repetitive beating?

      The frequencies of the bending motions measured from the movies in Fig.8 and Fig.8-suppl.1 were 0.6 – 1 Hz, and the motions were rather irregular. Even if there were complexes bending at high frequencies, it would not have been possible to detect them due to the low time resolution of these fluorescence microscopy experiments (~0.1 s). Future studies at a higher time resolution will be necessary for further characterization of bending motions.

      To observe bending motions, the dynein-MT complex should be fixed to the glass or a bead at one part of the complex while the other end is free in solution. With the dynein-MT-DNA-origami complexes, we looked for such complexes and found some showing bending motions as in Fig. 8. To answer the reviewer’s question asking if we saw repetitive bending in other systems, we checked the movies of the complexes without DNA-origami or without ODAs arraying oppositely but did not notice any repetitive bending motions. However, future studies using the system with a higher temporal resolution and perhaps with an improved method for attaching the complex would be necessary in these cases as well.

    1. Author Response

      Reviewer #1 (Public Review):

      Overall, this study is well designed with convincing experimental data. The following critiques should be considered:

      1) It is important to examine whether the phenotype of METTL18 KO is mediated through change with RPL3 methylation. The functional link between METTL18 and RPL3 methylation on regulating translation elongation need to be examined in details.

      We truly thank the reviewer for the suggestion. Accordingly, we set up experiments combined with hybrid in vitro translation (Panthu et al. Biochem J 2015 and Erales et al. PNAS 2017) and the Renilla–firefly luciferase fusion reporter system (Kisly et al. NAR 2021) (see Figure 5A).

      To test the impact of RPL3 methylation on translation directly, we purified ribosomes from METTL18 KO cells or naïve HEK293T cells supplemented with ribosome-depleted rabbit reticulocyte lysate (RRL) and then conducted an in vitro translation assay (i.e., hybrid translation, Panthu et al. Biochem J 2015 and Erales et al. PNAS 2017) (see figure above and Figure 5A). Indeed, we observed that removal of the ribosomes from RRL decreased protein synthesis in vitro and that the addition of ribosomes from HEK293T cells efficiently recovered the activity (see Figure 5 — figure supplement 1A).

      To test the effect on Tyr codon elongation, we harnessed the fusion of Renilla and firefly luciferases; this system allows us to detect the delay/promotion of downstream firefly luciferase synthesis compared to upstream Renilla luciferase and thus to focus on elongation affected by the sequence inserted between the two luciferases (Kisly et al. NAR 2021) (see figure above and Figure 5A). For better detection of the effects on Tyr codons, we used the repeat of the codon (×39, the number was due to cloning constraints in our hands). We note that the insertion of Tyr codon repeats reduced the elongation rate (or processivity), as we observed a reduced slope of downstream Fluc synthesis (see Figure 5 — figure supplement 1B).

      Using this setup, we observed that, compared to ribosomes from naïve cells, RPL3 methylation-deficient ribosomes led to faster elongation at Tyr repeats (see Figure 5B). These data, which are directly reflected by the ribosomes possessing unmethylated RPL3, provided solid evidence of a link between RPL3 methylation and translation elongation at Tyr codons.

      2) The obvious discrepancy between the recent NAR an this study lies in the ribosomal profiling results (such as Fig.S5). The cell line specific regulation between HAP1 (previously used in NAR) vs 293T cell used here ( in this study) needs to be explored. For example, would METLL18 KO in HAP1 cells cause polysome profiling difference in this study? Some of negative findings in this study (such as Fig.S3B, Fig.S5A) would need some kind of positive control to make sure that the assay condition would be working.

      According to the reviewer’s suggestion, we conducted polysome profiling of the HAP1 cells with METTL18 knockout. For this assay, we used the same cell line (HAP1 METTL18 KO, 2-nt del.) as in the earlier NAR paper. As shown in Figure 9 — figure supplement 2A and 2B, we observed reduced polysomes in this cell line, as observed in the NAR paper.

      We did not find the abundance of 40S and 60S by assessing the rRNAs and the complex mass in the sucrose gradient (see Figure 9 — figure supplement 2C-E) by METTL18 KO in HAP1 cells. This observation was again consistent with earlier reports.

      Overall, our experiments in sucrose density gradient (polysome and 40S/60S ratio) were congruent with NAR paper. A difference from our finding in HEK293T cells was the limited effect on polysome formation by METTL18 deletion (Figure 4 — figure supplement 1A and 1B). To further provide a careful control for this observation, we induced a 60S biogenesis delay, as requested by the Reviewer. Here, we treated cells with siRNA targeting RPL17, which is needed for proper 60S assembly (Wang et al. RNA 2015). The quantification of SDG showed a reduction of 60S (see figure below and Figure 3 — figure supplement 1D-F) and polysomes (see Figure 4 — figure supplement 1C and 1D), highlighting the weaker effects of METTL18 depletion on 60S and polysome formation in HEK293T cells. We note that all the sucrose density gradient experiments were repeated 3 times, quantified, and statistically tested.

      To further assess the difference between our data and those in the earlier NAR paper, we also performed ribosome profiling on 3 independent KO lines in HAP1 cells, including the one used in the NAR paper (METTL18 KO, 2-nt del.). Indeed, all METTL18 KO HAP1 cells showed a reduction in footprints on Tyr codons, as observed in HEK293 cells (see Figure 4H), and thus, there was a consistent effect of RPL3 methylation on elongation irrespective of the cell type. On the other hand, we could not find such a trend (see figure below) by reanalysis of the published data (Małecki et al. NAR 2021).

      Thus far, we could not find the origin of the difference in ribosome profiling compared to the earlier paper. Culture conditions or other conditions may affect the data. Given that, we amended the discussion to cover the potential of context/situation-dependent effects on RPL3 methylation.

      3) For loss-of-function studies of METLL18, it will be beneficial to have a second sgRNA to KO METLL18 to solidify the conclusion.

      We thank the reviewer for the constructive suggestion. Instead of screening additional METTL18 KO in HEK293T cells, we conducted additional ribosome profiling experiments in HAP1 cells with 3 independent KO lines. In addition to ensuring reproducibility, these experiments should assess whether our results are specific to the HEK293T cells that we mainly used. As mentioned above, even in the different cell lines, we observed faster elongation of the Tyr codon by METTL18 deficiency.

      4) In addition to loss-of-function studies for METLL18, gain-of-function studies for METLL18 would be helpful for making this study more convincing.

      Again, we thank the reviewer for the constructive suggestion. To address this issue, we conducted RiboTag-IP and subsequent ribosome profiling. Here, we expressed Cterminal FLAG-tagged RPL3 of its WT and His245Ala mutant, in which METTL18 could not add methylation (Figure 2A), in HEK293T cells, treated the lysate with RNase, immunoprecipitated FLAG-tagged ribosomes, and then prepared a ribosome profiling library (see figure below, left). This experiment assessed the translation driven by the tagged ribosomes. Indeed, we observed that, compared to the difference in Tyr codon elongation in METTL18 KO vs. naïve cells, His245Ala provided weaker impacts (see figure below, right). Given that METTL18 KO provides unmodified His, the enhanced Tyr elongation may be mediated by the bare His but not by Ala in that position. Since this point may be beyond the scope of this study, we omitted it from the manuscript. However, we are happy to add the data to the supplementary figures if requested.

      Reviewer #3 (Public Review):

      In this article, Matsuura-Suzuki et al provided strong evidence that the mammalian protein METTL18 methylates a histidine residue in the ribosomal protein RPL3 using a combination of Click chemistry, quantitative mass spectrometry, and in vitro methylation assays. They showed that METTL18 was associated with early sucrose gradient fractions prior to the 40S peak on a polysome profile and interpreted that as evidence that RPL3 is modified early in the 60S subunit biogenesis pathway. They performed cryo-EM of ribosomes from a METTL18-knockout strain, and show that the methyl group on the histidine present in published cryo-EM data was missing in their new cryo-EM structure. The missing methyl group gave minor changes in the residue conformation, in keeping with the minor effects observed on translation. They performed ribosome profiling to determine what is being translated efficiently in cells with and without METTL18, and found decreased enrichment of Tyrosine codons in the A site of ribosomes from cells lacking METTL18. They further showed that longer ribosome footprints corresponding to sequences within ribosomes that have already bound to A-site tRNA contained less Tyrosine codons in the A site when lacking METTL18. This suggests methylation normally slows down elongation after tRNA loading but prior to EF-2 dissociation. They hypothesize that this decreased rate affects protein folding and follow up with fluorescence microscopy to show that EGFP aggregated more readily in cells lacking METTL18, suggesting that translation elongation slow down mediated by METTL18 leads to enhanced folding. Finally, they performed SILAC on aggregated proteins to confirm that more tyrosine was incorporated into protein aggregates from cells lacking METTL18.

      The article is interesting and uses a large number of different techniques to present evidence that histidine methylation of RPL3 leads to decreased elongation rates at Tyrosine codons, allowing time for effective protein folding.

      We thank the reviewer for the positive comments.

      I agree with the interpretation of the results, although I do have minor concerns:

      1) The magnitude of each effect observed by ribosome profiling is very small, which is not unusual for ribosome modifications or methylation. Methylation seems to occur on all ribosomes in the cell since the modification is present in several cryo-EM structures. The authors suggest that the modification occurs during biogenesis prior to folding and being inaccessible to METTL18, so it is unlikely to be removed. For that reason, I do not think it is warranted to claim that this is an example of a ribosome code, or translation tuning. Those terms would indicate regulated modifications that come on and off of proteins, but the authors have not presented evidence that the activity is regulated (and don't really need to for this paper to be impactful).

      We thank the reviewer for making this point, and we agree that the nuance of the wording may not fit our results. We amended the corresponding sentences to avoid using the terms “ribosome code” and “translation tuning” throughout the manuscript.

      2) In Figure 4-supplement 1, it appears there are slightly more 80S less 60S in the METTL18 knockout with no change in 40S. It might be normal variability in this cell type, but quantitation of the peaks from 2 or more experiments is needed to make the claim that ribosome biogenesis is unaffected by METTL18 deletion. Likewise, the authors need to quantitate the area under the curve for 40S and 60S levels from several replicates and show an average -/+ error for figure 3, supplement 1 because that result is essential to claim that ribosome biogenesis is unaffected.

      Accordingly, we repeated all the sucrose density gradient experiments 3 times, quantified the data, and statistically tested the results. Even in the quantification, we could not find a significant change in either the 40S or 60S levels by METTL18 deletion in HEK293T cells (see Figure 3 — figure supplement 1B and 1C).

      Moreover, for the positive control of 60S biogenesis delay, we treated cells with siRNA targeting RPL17, which is needed for proper 60S assembly (Wang et al. RNA 2015). The quantification of SDG showed a reduction in 60S (see figure below and Figure 3 — figure supplement 1D-F) and polysomes (see Figure 4 — figure supplement 1C and 1D), highlighting the weaker effects of METTL18 depletion on 60S and polysome formation.

      3) The effect of methylation could be any step after accommodation of tRNA in the A site and before dissociation of EF-2, including peptidyl transfer. More evidence is needed for claiming strongly that methylation slows translocation specifically. This could be followed up in vitro in a new study.

      We truly thank the reviewer for the suggestion. Accordingly, we set up experiments combined with hybrid in vitro translation (Panthu et al. Biochem J 2015 and Erales et al. PNAS 2017) and the Renilla–firefly luciferase fusion reporter system (Kisly et al. NAR 2021) (see Figure 5A).

      To test the impact of RPL3 methylation on translation directly, we purified ribosomes from METTL18 KO cells or naïve HEK293T cells supplemented with ribosome-depleted rabbit reticulocyte lysate (RRL) and then conducted an in vitro translation assay (i.e., hybrid translation, Panthu et al. Biochem J 2015 and Erales et al. PNAS 2017) (see figure above and Figure 5A). Indeed, we observed that removal of the ribosomes from RRL decreased protein synthesis in vitro and that the addition of ribosomes from HEK293T cells efficiently recovered the activity (see Figure 5 — figure supplement 1A).

      To test the effect on Tyr codon elongation, we harnessed the fusion of Renilla and firefly luciferases; this system allows us to detect the delay/promotion of downstream firefly luciferase synthesis compared to upstream Renilla luciferase and thus to focus on elongation affected by the sequence inserted between the two luciferases (Kisly et al. NAR 2021) (see figure above and Figure 5A). For better detection of the effects on Tyr codons, we used the repeat of the codon (×39, the number was due to cloning constraints in our hands). We note that the insertion of Tyr codon repeats reduced the elongation rate (or processivity), as we observed a reduced slope of downstream Fluc synthesis (see Figure 5 — figure supplement 1B).

      Using this setup, we observed that, compared to ribosomes from naïve cells, RPL3 methylation-deficient ribosomes led to faster elongation at Tyr repeats (see Figure 5B). These data, which are directly reflected by the ribosomes possessing unmethylated RPL3, provided solid evidence of a link between RPL3 methylation and translation elongation at Tyr codons.

    1. Author Response

      Reviewer #1 (Public Review):

      Adefuin and colleagues examined the interaction between components of binary odor mixtures in odor responses in mice. The authors used two-photon calcium imaging from the soma and apical dendrites of mitral/tufted cells in the olfactory bulb. Odor responses were measured in various conditions: under anesthesia (ketamine/xylazine), while well-trained mice were engaged in an odor discrimination task, or disengaged. The authors first show that mixture components interacted sublinearly in a large fraction of mitral/tufted cells (46%; Fig. 6D) consistent with previous studies. However, when odor responses were measured in awake animals, very few mitral/tufted cells showed sublinear responses at soma (8-9%; Fig. 6D). Interestingly, sublinear interaction was evident in apical dendrites of mitral/tufted cells (45%). Whether mixture components are represented linearly or not in the olfactory system is an important question, related to the animal's ability to identify or segment mixture components. Somewhat contrary to previous studies, this study demonstrate largely linear interactions. Furthermore, this study compares various behavioral conditions. These results are important and of interest to those who study sensory systems. I have a few concerns regarding data analysis.

      Thank you for your helpful review, and for recognising the relevance our work. We hope that the reviewer finds the our point-by-point responses satisfactory.

      1) Non-linear interactions are detected by the activity showing a deviation from linearity greater than 2 standard deviations. Using this criterion, non-linear interactions might decrease if the trial-by-trial activity becomes more variable. This is concerning because the activity might be less variable in the anesthetized condition, and the reduction in sublinear interactions in awake conditions may be due to a general increase in response variability during awake. Can the authors exclude the possibility that the decrease in sublinear interactions is merely due to an increase in response variability in the awake conditions. This issue also applies to the comparison between apical dendrites versus soma; are the signals in apical dendrite less variable (maybe due to some averaging across dendrites from multiple cells; see the following point 5)?

      Thank you for raising this valid point and for suggesting alternative analyses. We agree that the index we used previously is susceptible to noise, and not appropriate for comparing two datasets with different trial-by-trial variability. To quantify the deviation from linear sum more robustly, we now use the “Median fractional deviation”, which expresses a deviation from the linear sum as a fraction of predicted, linear sum - not normalised by the standard deviation – and take the median of the distribution from each field of view. As we describe in the revised Figure 4, this measure is more robust to noise. Notably, our finding that mixture summation is generally less sublinear in awake mice still stands for the early phase.

      In the revised manuscript, we use the median fractional deviation whenever we compare linearity of summation across different conditions, which includes the comparison of anaesthetised vs. awake, behaving conditions (revised Fig. 4), comparison of dendrites vs. somata (revised Fig. 4-figure supplement 1), and comparisons of awake states (revised Fig. 6). This has given us, too, more confidence about our interpretation, so we are grateful for the reviewer’s suggestions.

      2) Related to the above issue, it would be useful to analyze the difference between conditions using different metrics to fully understand what really are different between conditions. The scatter plots shown in various figures do not show drastic differences between awake and anesthetized conditions, as might be indicated by the percent of sublinear responses. It would be useful to characterize the magnitude of sublinear/supralinear effects. For example, one can calculate a fractional change in the mean response. Does this measure show consistent difference between awake and anesthetized conditions?

      Thank you for suggesting this analysis. As described above, we now use the fractional deviation to quantify how mixture summations differ from linear sums, which turned out to be a very useful way to express the property of summation (N.B.: noise is amplified for small responses when fractional deviation is used, which is another reason we use the median now). We thank the reviewer for suggesting this analysis.

      Reviewer #2 (Public Review):

      This study addresses how complex stimuli are represented in neural responses. This is particularly relevant to olfaction because the vast majority of stimuli are complex mixtures that perceptually, are not easy to decompose into parts. Nonetheless, the ability to discern a relevant odor from background odors is essential. This process is easier when neural responses to mixtures reflect the linear sum of the responses to the individual components. The main conclusion of this study is that the linearity of olfactory bulb responses to two-component mixtures increases awake versus anesthetized states. The authors provide some evidence to support this claim. However, this could be better quantified and there is a temporal aspect of linearization that is not addressed. Perhaps the most interesting aspect of the study is the difference in linearity between the dendrites and the somata of the mitral/tufted cells. But a statistical analysis of this finding was not evident. Overall a mechanistic or functional approach to understanding these findings is lacking. The differences linearity between the anesthetized and awake are simply explained by response saturation anesthetized animals. There are hints at mechanism by which linearity is supported in the OB with comparisons between soma and dendrite but these are not well developed. There is a model that addresses the functional significance of linearity but this is only supplemental and not well described.

      Thank you for appreciating the significance of our work, and for your constructive comments.

      Reviewer #3 (Public Review):

      Adefuin et al use multiphoton imaging of M/T cell responses to investigate whether neuronal representations of binary mixtures can be explained as a sum of the components. The current view in the field (built largely from studies in anesthetized animals), is that mixture summation is non-linear and increases with the degree in glomerular response overlap elicited by the components. The authors reproduce these results and ask whether the same phenomenon is observed in the awake state, in particular when the animals are engaged in an odor discrimination task. Unlike in the anesthetized state, the authors find that mixture representations are linear in the awake brain. They use a series of systematic behavioral paradigms to show that the observed linearity in the awake state (compared to anesthetized) is not dependent on task engagement (reward is given randomly, post-odor) or stimulus relevance (reward is given before odor). While the experiments are well done and the data is presented clearly, I have several major concerns about the interpretation of their results.

      1) Given the data the authors present, it is unclear if one can conclude that the olfactory system is more or less linear in the awake state compared to the anaesthetised one. What seems to change most across the awake vs. anesthetized state is the response amplitude. Responses appear to be ~3x smaller in the awake mice. In the anesthetized state, non-linearity seems most apparent for large response amplitudes (>5 dF/F) with mixture responses being sub-linear, most likely due to saturation effects. The authors themselves do an analysis in Figure 6 - supplement 1 to show that most of the observed non-linearity in the anesthetized animals can be explained away after accounting for amplitude normalisation. The authors use this analysis to comment that the level of linearity is the same across all the three awake states, but the same figure shows that it is in fact the same even for the anaesthetized state.

      To put it differently, it is indeed true from the authors data that the OB response gain is significantly lower in the awake state, but it is unclear if the summation is more linear if measured at similar response amplitude regimes in both awake and anaesthetised mice.

      Thank you for the valuable comments. We agree that many differences between the anaesthetised vs. awake states should have been taken into account when comparing the linearity of summation. We address the reviewer’s concern now by expressing the deviation as a fraction of the predicted, linear sum of component responses. Further, we also considered another factor that could influence the anaesthetised vs. awake comparison, namely, the trial-by-trial variability. This is reproduced below.

      Figure R1: comparison of mixture summation for the early phase of responses, expressed as the fractional deviation.

      2) The authors argue that keeping response amplitudes small in the awake brain prevents sub-linear summation and therefore may lead to better mixture decomposition. They do a decoding analysis in anaesthetised mice to show that linear mixture representations (instead of using observed sub-linear representations) make odor classification easier. However, I find this analysis uninformative and misleading. It is no surprise that the decoders trained on single odor representations should perform better (or equivalent) when using linear sums as input instead of observed sub-linear representations. The authors use this observation to suggest that this mechanism aids discrimination ability in the awake state. However, given that even the single odor responses are much weaker and noisier in the awake state, it is likely that even the single odor discrimination ability is poorer in the awake state. By the same logic, mixture decomposition might be also much poorer in the awake brain than the anesthetized brain, even though summation is more linear, just because responses are weaker and noisier. In my opinion, the authors should compare decoding accuracy across awake vs. anesthetized responses if they want to assert that linearisation of responses in the awake brain leads to easier decomposition. Because otherwise, while linearisation in principle can aid decomposition, at least in the form that the authors observe here, it may come at a high cost on signal-to-noise ratio which would undo the gain that linearity provides, in principle, for discrimination.

      Thank you very much for the insight and for the excellent suggestion to consider the discriminability of stimuli. In particular, we now include an analysis where a decoder trained on single responses is tested on observed mixture responses. Surprisingly, despite the substantial differences in the amplitudes of response and trial-by-trial variability, decoders using data from awake mice performed well, even better than anaesthetised data for the late phase of responses. This is now described in the revised figures (revised Fig. 5). We thank the reviewer for the excellent suggestion.

      Interestingly, though, the time course of the decoder performance does not correlate well with the linearity of summation. This observation is now described in the abstract (lines 19-21): “…decoding analyses indicated that the data from behaving mice was able to encode mixture responses well, though the time course of decoding accuracy did not correlate with the linearity of summation“.

      3) At a more philosophical level, to this Reviewer, it is unclear if anesthesia vs. awake state difference in response should constitute the main focus of the manuscript. The authors explore summation properties under four different brain states, one of which is anaesthesia (also least behaviorally relevant). In three out of four states, they observe that summation is linear. In the fourth (anaesthesia), they observe that summation is sub-linear, but this happens at much larger response amplitude regimes compared to the three awake states sampled, presumably due to saturation. To me, it seems that the Authors here show that mixture summation in the OB, is largely independent of brain state since it is unaffected by whether the animal is task engaged or motivated etc.

      Thank you for this thoughtful comment. This has made us reflect on the essence of our study. We believe we make three main observations. First, the anaesthesia vs. awake difference in the property of summation differ, and should be reported, because of the large volume of prior works reporting sublinear summations. However, as the reviewer recommends and as mentioned next, this is no longer the sole focus of our study. Our second observation is that the linearity of summation does not necessarily correlate with the ability to analyse mixtures, based on the decoder performance. We believe it is important to share this observation, since a number of previous studies speculated that nonlinear summation contributes to perceptual difficulty (Bell et al., 1987; Laing, 1994). Third, the decoder performance - especially one that is trained on single odour responses and tested on mixtures - shows differences depending on the awake states, where data from disengaged mice performed particularly poorly. This result is shown in the revised Figure 6. Further, we have edited the abstract and results to ensure that these are clearly communicated. We hope that this is more balanced and reflects the data better.

      4) It is unclear how to interpret the dendritic imaging comparison. First, the dendritic signal is pooled across many cells. If any of the cells that are being pooled shows sub-linearity, the pooled population response will look sub-linear, albeit less so than at the single cell level. Second, again like for the anesthetized vs. awake comparison, there is a discrepancy in response amplitudes - dendritic responses are ~2x stronger than the somatic responses and sub-linear summation would be more apparent as one approaches the saturation regime. Third, dendritic responses pool both mitral and tufted, while the somatic data the authors present is predominantly from tufted cells.

      Thank you for commenting on ways to further understand the dendritic signal. Indeed, the early prevalence of sublinearity in the apical dendrites does seem to relate to the time course of responses. This is treated more directly in the revised Fig.4 – supplement 1.

      To address the averaging effect, we tested how pulled signals may look like in terms of linearity of summation. To roughly approximate pooled responses, we reasoned that neighbouring TC/MC somata have higher chances of belonging to the same glomerulus. Thus, we averaged signals from somatic ROIs (TCs and MCs) from each field of view and calculated the fractional deviation from the linear sum (Fig. R2). While a simplistic averaging of neighbouring somata may not be perfectly accurate, but this analysis indicates that the difference between the apical dendrites vs. somata may not be simply explained by the averaging effect.

      Figure R2: Analysis of pooled somatic signals

      To approximate how dendritic signals might look like if they were simple averages of somatic responses, we pooled together signals from all TC/MC somata from each field of view, and treated it as “an approximate glomerular signal”. The plot above shows the fractional deviation from the linear sum. MC somata data comes from an additional set of experiments conducted for this rebuttal).

      In terms of the unmatched amplitude distributions and trial-by-trial variability across conditions, as the reviewer points out, the issue is similar to the comparison of anaesthetised vs. awake data. To address this, all comparisons are now presented in terms of the median fractional deviations. Further, to explain if mitral cells contributed to the discrepancy in the linearity between the dendritic signal vs. somatic signal, we now provide additional data from 137 MCs (5 fields of view, 3 trained mice performing the mixture task). These changes are described in the revised manuscript (Figure 4- supplement 1).

    1. Author Response

      Reviewer #1 (Public Review):

      Using health insurance claims data (from 8M subjects), a retrospective propensity score matched cohort study was performed (450K in both groups) to quantify associations between bisphosphonate (BP) use and COVID- 19 related outcomes (COVID-19 diagnosis, testing and COVID-19 hospitalization. The observation periods were 1-1-2019 till 2-29-2020 for BP use and from 3-1-2020 and 6-30-2020 for the COVID endpoints. In primary and sensitivity analyses BP use was consistently associated with lower odds for COVID-19, testing and COVID-19 hospitalization.

      The major strength of this study is the size of the study population, allowing a propensity-based matched- cohort study with 450K in both groups, with a sizeable number of COVID-19 related endpoints. Health insurance claims data were used with the intrinsic risk of some misclassification for exposure. In addition there probably is misclassification of endpoints as testing for COVID-19 was limited during the study period. Furthermore, the retrospective nature of the study includes the risk of residual confounding, which has been addressed - to some extent - by sensitivity analyses.

      In all analyses there is a consistent finding that BP exposure is associated with reduced odds for COVID-19 related outcomes. The effect size is large, with high precision.

      The authors extensively discuss the (many) potential limitations inherent to the study design and conclude that these findings warrant confirmation, preferably in intervention studies. If confirmed BP use could be a powerful adjunct in the prevention of infection and hospitalization due to COVID-19.

      We thank the reviewer for this overall very positive feedback. We appreciate the reviewer's comments regarding the potential risks associated with misclassification of exposure and other potential limitations, which we have sought to address in a number of sensitivity analyses and are also addressing in the discussion of our paper. In addition, as noted by the reviewer, the observed effect size of BP use on COVID-19 related outcomes is large, with high precision, which we feel is a strong argument to explore this class of drugs in further prospective studies.

      Reviewer #2 (Public Review):

      The authors performed a retrospective cohort study using claims data to assess the causal relationship between bisphosphonate (BP) use and COVID-19 outcomes. They used propensity score matching to adjust for measured confounders. This is an interesting study and the authors performed several sensitivity analyses to assess the robustness of their findings. The authors are properly cautious in the interpretation of their results and justly call for randomized controlled trials to confirm a causal relationship. However, there are some methodological limitations that are not properly addressed yet.

      Strengths of the paper include:

      (A) Availability of a large dataset.

      (B) Using propensity score matching to adjust for confounding.

      (C) Sensitivity analyses to challenge key assumptions (although not all of them add value in my opinion, see specific comments)

      (D) Cautious interpretation of results, the authors are aware of the limitations of the study design.

      Limitation of the paper are:

      (A) This is an observational study using register data. Therefore, the study is prone to residual confounding and information bias. The authors are well aware of that.

      (B) The authors adjusted for Carlson comorbidity index whereas they had individual comorbidity data available and a dataset large enough to adjust for each comorbidity separately.

      (C) The primary analysis violates the positivity assumption (a substantial part of the population had no indication for bisphosphonates; see specific comments). I feel that one of the sensitivity analyses 1 or 2 would be more suited for a primary analysis.

      (D) Some of the other sensitivity analyses have underlying assumptions that are not discussed and do not necessarily hold (see specific comments).

      In its current form the limitations hinder a good interpretation of the results and, therefore, in my opinion do not support the conclusion of the paper.

      The finding of a substantial risk reduction of (severe) COVID-19 in bisphosphonate users compared to non- users in this observational study may be of interest to other researchers considering to set up randomized controlled trials for evaluation of repurpose drugs for prevention of (severe) COVID-19.

      We thank the reviewer for the insightful comments and questions related to our manuscript. Our response to the concerns regarding limitations of our study is as follows:

      (A) We agree that there is likely residual confounding and information bias due to use of US health insurance claims datasets which do not include information on certain potentially relevant variables. Nonetheless, given the large effect size and precision of our analysis, we feel that our findings support our main conclusion that additional prospective trials appear warranted to further explore whether BPs might confer a meaure of protection against severe respiratory infections, including COVID-19. We have added a sentence on the second page of our Discussion (line 859-860) to emphasize this point: "Specifically, there is the potential that key patient characteristics impacting outcomes could not be derived from claims data."

      (B) The progression of this study mirrors the real-world performance of the analysis where we initially used the CCI in matching to control for comorbidity burden on a broader scale. This was our a priori approach. After observing large effect sizes, we performed more stringent matching for sensitivity analyses 1 and 2. Irrespective of the matching strategy chosen, effect sizes remained similar for all outcome parameters. Therefore, we elected to include both the primary analysis and the sensitivity analyses with more stringent matching in order to more transparently show what was done in entirety during our analyses, as we feel it displays all of the efforts taken to identify sources of unmeasured confounding which could have impacted our results.

      (C) We agree that the positivity assumption is a key factor to consider when building comparable treatment cohorts. We also agree that it is the important to separately perform the analysis for either all patients with an indication for use of BPs and for other anti-osteoporosis medications, as we have done in our analysis of the Osteo-Dx-Rx cohort and Bone-Rx cohort, respectively. However, we did not have sufficient data, a priori, to determine whether BP users would be more similar in their risk of COVID-19 outcomes to non- users or to other users of anti-resorptive medications. In addition, we believe that this specific limitation does not negate our findings in the primary analysis for the following reasons: (1) ‘Type of Outcome’: the outcomes in this study are related to infectious disease and are not direct clinical outcomes of any known treatment benefits of BPs. The clinical benefits being assessed - impact of BP use on COVID-19-related outcomes - were essentially unknown at the time of the study data; this fact mitigates the impact of any violation of the positivity assumption; and (2) ‘Clinical Population’: after propensity score matching, both the BP user and the BP non-user group in the primary analysis mainly consisted of older females (90.1% female, 97.2% age>50), which is the main population with clinical indications for BP use. According to NCHS Data Brief No. 93 (April 2012) released by the CDC, ~75% and 95% of US women between 60-69 and 70-79 suffer from either low bone mass or osteoporosis, respectively, and essentially all women (and 70% of men) above age 80 suffer from these conditions, which often go undiagnosed (https://www.cdc.gov/nchs/data/databriefs/db93.pdf). Women aged 60 and older make up ~75% of our study population (Table 1). Although bone density measurements are not available for non- BP users in the matched primary cohort, there is a high probability that the incidence of osteoporosis and/or low bone mass in these patients was similar to the national average. This justifies the assumption that BP therapy was indicated for most non-BP users in the matched primary cohort. Arguably, for these patients the positivity assumption was not violated.

      (D) We will discuss in detail below the specific issues raised by the reviewer regarding our sensitivity analyses. In general we acknowledge that individual analytical and/or matching approaches may each have their own limitations, but the analyses performed herein were done to test in a systematic fashion the different critical threats to the validity of our initial results in the primary cohort analysis, which were based on a priori-defined methods and yielded a large and robust effect size. Thus, the individual sensitivity analyses should be considered in the greater context of the entire project.

      Specific comments (in order of manuscript):

      Methods:

      Line 158: it is unclear how the authors dealt with patients who died during the follow-up period. The wording suggests they were excluded which would be inappropriate.

      When this study was executed, we were unable to link the patient-level US insurance claims data with patient-level mortality data due to HIPAA concerns. Therefore, line 158 (now 177) defines continuous insurance coverage during the observation period as a verifiable eligibility criterion we used for patient inclusion. It was necessary to disqualify individuals who discontinued insurance coverage for a variety of reasons, e.g. due to loss or change of coverage, relocation etc., but our approach also eliminated patients who died. Appendix 3 (line 2449ff) describes methods we employed post hoc to assess how censoring due to death could have impacted our analyses. We discuss our conclusions from this post hoc analysis in the main text (lines 1053-1058) as follows: "An additional limitation is potential censoring of patients who died during the observation period, resulting in truncated insurance eligibility and exclusion based on the continuous insurance eligibility requirement. However, modelling the impact of censoring by using death rates observed in BP users and non-users in the first six months of 2020 and attributing all deaths as COVID-19-related did not significantly alter the decreased odds of COVID-19 diagnosis in BP users (see Appendix 3)."

      Why did the authors use CCI for propensity matching rather than the individual comorbid conditions? I presume using separate variables will improve the comparability of the cohorts. The authors discuss imbalances in comorbidities as a limitation but should rather have avoided this.

      CCI was the a priori approach defined at the study outset and was chosen due to the widespread use and understanding of this score. The general CCI score was originally planned for matching in order to have the largest possible study population since we did not know how many patients would meet all criteria as well as have an event of interest. After realizing we had adequate sample size to power matching using stricter criteria, we proceeded to perform subsequent sensitivity analyses on more stringently matched cohorts (sensitivity analysis 2).

      Line 301-10: it seems unnecesary to me to adjust for the given covariates while these were already used for propensity score matching (except comorbidities, but see previous comment). The manuscript doesn't give a rationale why did the authors choose for this 'double correction'.

      The following language was added to the methods section (lines 325-327): “Demographic characteristics used in the matching procedure were also included in the final outcome regressions to control for the impact of those characteristics on outcomes modelled.”

      The following language was added to the Discussion section regarding the potential limitations of our srudy (lines 1078-1085): “Another limitation in the current study is related to a potential ‘double correction’ of patient characteristics that were included in both the propensity score matching procedure as well as the outcome regression modelling, which could lead to overfitting of the regression models and an overestimation of the measured treatment effect. Covariates were included in the regression models since these characteristics could have differential impacts on the outcomes themselves, and our results show that the adjusted ORs were in fact larger (showing a decreased effect size) when compared to the unadjusted ORs, which show the difference in effect sizes of the matched populations alone.”

      In causal research a very important assumption is the 'positivity assumption', which means that none of the individuals has a probability of zero or one to be exposed. Including everyone would therefore not be appropriate. My suggestion is to include either all patients with an indication (based on diagnosis) or all that use an anti-osteoporosis (AOP) drug (or one as the primary and the other as the sensitivity analysis) instead of using these cohorts as sensitivity analyses. The choice should in my opinion be based on two aspects: whether it is likely that other AOP drugs have an effect on the COVID-19 outcomes and whether BP users are deemed to be more similar (in their risk of COVID-19 outcomes) to non-users or to other AOP drug users. Or alternatively, the authors might have discussed the positivity assumption and argue why this is not applicable to their primary analysis.

      The following text has been added to the Discussion section addressing potential limitations of our study (lines 987-1009): " Another potential limitation of this study relates to the positivity assumption, which when building comparable treatment cohorts is violated when the comparator population does not have an indication for the exposure being modelled 56. This limitation is present in the primary cohort comparisons between BP users and BP non-users, as well as in the sensitivity analyses involving other preventive medications. This limitation, however, is mitigated by the fact that the outcomes in this study are related to infectious disease and are not direct clinical outcomes of known treatment benefits of BPs. The fact that the clinical benefits being assessed – the impact of BPs on COVID-related outcomes – was essentially unknown clinically at the time of the study data minimizes the impact of violation of the positivity assumption. Furthermore, our sensitivity analyses involving the “Bone-Rx” and “Osteo-Dx- Rx” cohorts did not suffer this potential violation, and the results from those analyses support those from the primary analysis cohort comparisons. Moreover, we note that the propensity score matched BP users and BP non-users in the primary analysis cohort mainly consisted of older females. According to the CDC, ~75% and 95% of US women between 60-69 and 70-79 suffer from either low bone mass or osteoporosis, respectively (https://www.cdc.gov/nchs/data/databriefs/db93.pdf). Essentially all women (and 70% of men) above age 80 suffer from these conditions, which often go undiagnosed. Women aged 60 and older represent ~75% of our study population (Table 1). Although bone density measurements are not available for non-BP users in the matched primary cohort, there is a high probability that the incidence of osteoporosis and/or low bone mass in these patients was similar to the national average.Thus, BP therapy would have been indicated for most non-BP users in the matched primary cohort, and arguably, for these patients the positivity assumption was not violated."

      Sensitivity Analysis 3: Association of BP-use with Exploratory Negative Control Outcomes: what is the implicit assumption in this analysis? I think the assumption here is that any residual confounding would be of the same magnitude for these outcomes. But that depends on the strength of the association between the confounder and the outcome which needs not be the same. Here, risk avoiding behavior (social distancing) is the most obvious unmeasured confounder, which may not have a strong effect on other health outcomes. Also it is unclear to me why acute cholecystitis and acute pancreatitis-related inpatient/emergency-room were selected as negative controls. Do the authors have convincing evidence that BPs have no effect on these outcomes? Yet, if the authors believe that this is indeed a valid approach to measure residual confounding, I think the authors might have taken a step further and present ORs for BP → COVID-19 outcomes that are corrected for the unmeasured confounding. (e.g. if OR BP → COVID-19 is ~ 0.2 and OR BP → acute cholecystitis is ~ 0.5, then 'corrected' OR of BP → COVID-19 would be ~ 0.4.

      We appreciate the reviewer’s thoughtful comments regarding the differential strength of the association between unmeasured confounders and outcome. We had initially selected acute cholecystitis and pancreatitis-related inpatient and emergency room visits as negative controls because we deemed them to be emergent clinical scenarios that should not be impacted by risk avoiding behavior. However, upon further search, we identified several publications that suggest a potential impact of osteoporosis and/or BPs on gallbladder diseases (DOIhttps://doi.org/10.1186/s12876-014-0192-z; http://dx.doi.org/10.1136/annrheumdis-2017-eular.3900), thus calling the validity our strategy into question. We therefore agree that the designation of negative control outcomes is problematic and adds relatively little to the overall story. Therefore, we have removed these analyses from the revised manuscript.

      Sensitivity Analysis 4: Association of BP-use with Exploratory Positive Control Outcomes: this doesn't help me be convinced of the lack of bias. If previous researchers suffered from residual confounding, the same type of mechanisms apply here. (It might still be valuable to replicate the previous findings, but not as a sensitivity analysis of the current study).

      We agree that the same residual confounding in previous research papers could be present in our study. Nonetheless, it was important to assess whether our analysis would be potentially subject to additional (or different) confounding due to the nature of insurance claims data as compared to the previous electronic record-based studies. Therefore, it was relevant to see if previous findings of an association between BP use and upper respiratory infections are observable in our cohort.

      The second goal of sensitivity analysis #4 (now #3) was to see whether associations could be found on different sets of respiratory infection-based conditions, both during the time of the pandemic/study period as well as during the pre-pandemic time, i.e. before medical care in the US was significantly impacted by the pandemic. In light of these considerations, we feel that sensitivity analysis 4 adds value by showing consistency in our core findings.

      Sensitivity Analysis 5: Association of Other Preventive Drugs with COVID-19-Related Outcomes: Same here as for sensitivity analysis 3: the assumption that the association of unmeasured confounders with other drugs is equally strong as for BPs. Authors should explicitly state the assumptions of the sensitivity analyses and argue why they are reasonable.

      The following sentence was added to the Discussion section (lines 1019-1020): “ "These analyses were based on the assumption that the association of unmeasured confounders with other drugs is comparable in magnitude and quality as for BPs."

      Results: The data are clearly presented. The C-statistic / ROC-AUC of the propensity model is missing.

      Unfortunately, a significant amount of time has passed since execution of our original analysis of the Komodo dataset by our co-authors at Cerner Enviza. To date, our ability to perform follow-up studies with the Komodo dataset (which is exclusively housed on Komodo's secure servers) has become limited because business arrangements between these companies have been terminated, and the pertinent statistical software is no longer active. This issue prevents us from attaining the original C-statistic and ROC-AUC information, however, we were able to extract the actual; propensity scores themselves for the base cohort matching (BP-users versus non-users). The table below illustrates that the distribution of propensity scores for the base cohort match ranged from <0.01 to a max of 0.49, with 81.4% of patients having a propensity score of 10-49%, and 52.9% of patients having a propensity score of 20-49%. This distribution is unlikely to reflect patients who had a propensity score of either all 0 or all 1.

      Discussion:

      When discussing other studies the authors reduce these results to 'did' or 'did not find an association'. Although commonly practiced, it doesn't justify the statistical uncertainty of both positive and negative findings. Instead I encourage the authors to include effect estimates and confidence intervals. This is particularly relevant for studies that are inconclusive (i.e. lower bound of confidence interval not excluding a clinically relevant reduction while upper bound not excluding a NULL-effect).

      We appreciate the reviewer’s suggestion and have added this information on p.21/22 in the Discussion.

      Line 1145 "These retrospective findings strongly suggest that BPs should be considered for prophylactic and/or therapeutic use in individuals at risk of SARS-CoV-2 infection." I agree for prophylactic use but do not see how the study results suggest anything for therapeutic use.

      We have removed “and/or therapeutic use” from this sentence (line 1088-1090).

      The authors should discuss the acceptability of using BPs as preventive treatment (long-term use in persons without osteoporosis or other indication for BPs). This is not my expertise but I reckon there will be little experience with long-term inhibiting osteoblasts in people with healthy bones. The authors should also discuss what prospective study design would be suitable and what sample size would be needed to demonstrate a reasonable reduction. (Say 50% accounting for some residual confounding being present in the current study.)

      Although BPs are also used in pediatric populations and in patients without osteoporosis (for example, patients with malignancy), we do recognize the lack of long-term safety data in use of BPs as preventative treatments. We tried to partially address this concern in our sub-stratified analysis of COVID-19 related outcomes and time of exposure to BP. Reassuringly, we observed that patients newly prescribed alendronic acid in February 2020 also had decreased odds of COVID-19 related outcomes (Figure 3B), suggesting that the duration of BP treatment may not need to be long-term. This was further discussed in the last paragraph of our Discussion where we state that " BP use at the time of infection may not be necessary for protection against COVID-19. Rather, our results suggest that prophylactic BP therapy may be sufficient to achieve a potentially rapid and sustained immune modulation resulting in profound mitigation of the incidence and/or severity of infections by SARS- CoV-2."

      We agree that a future prospective study on the effect of BPs on COVID-19 related outcomes will require careful consideration of the study design, sample size, statistical power etc. However, we feel that a detailed discussion of these considerations is beyond the scope of the present study.

      The authors should discuss the fact that confounders were based on registry data which is prone to misclassification. This can result in residual confounding.

      Some potential sources of misclassification have been discussed on line 932-948. In addition, the following language was added (line 970-985): "Additionally, limitations may be present due to misclassification bias of study outcomes due to the specific procedure/diagnostic codes used as well as the potential for residual confounding occurring for patient characteristics related to study outcomes that are unable to be operationalized in claims data, which would impact all cohort comparisons. For SARS- CoV-2 testing, procedure codes were limited to those testing for active infection, and therefore observations could be missed if they were captured via antibody testing (CPT 86318, 86328). These codes were excluded a priori due to the focus on the symptomatic COVID-19 population. Furthermore, for the COVID-19 diagnosis and hospitalization outcomes, all events were identified using the ICD-10 code for lab-confirmed COVID-19 (U07.1), and therefore events with an associated diagnosis code for suspected COVID-19 (U07.2) were not included. This was done to have a more stringent algorithm when identifying COVID-19-related events, and any impact of events identified using U07.2 is considered minimal, as previous studies of the early COVID-19 outbreak have found that U07.1 alone has a positive predictive value of 94%55, and for this study U07.1 captured 99.2%, 99.0%, and 97.5% of all COVID-19 patient-diagnoses for the primary, “Bone-Rx”, and “Osteo-Dx-Rx” cohorts, respectively."

    1. Author Response:

      Reviewer #1:

      In this paper, authors did a fine job of combining phylogenetics and molecular methods to demonstrate the parallel evolution across vRNA segments in two seasonal influenza A virus subtypes. They first estimated phylogenetic relationships between vRNA segments using Robinson-Foulds distance and identified the possibility of parallel evolution of RNA-RNA interactions driving the genomic assembly. This is indeed an interesting mechanism in addition to the traditional role for proteins for the same. Subsequently, they used molecular biology to validate such RNA-RNA driven interaction by demonstrating co-localization of vRNA segments in infected cells. They also showed that the parallel evolution between vRNA segments might vary across subtypes and virus lineages isolated from distinct host origins. Overall, I find this to be excellent work with major implications for genome evolution of infectious viruses; emergence of new strains with altered genome combination.

      Comments:

      I am wondering if leaving out sequences (not resolving well) in the phylogenic analysis interferes with the true picture of the proposed associations. What if they reflect the evolutionary intermediates, with important implications for the pathogen evolution which is lost in the analyses?

      We fully appreciate this concern and have explored this extensively. One principle assumption underlying the approach we outline in this manuscript is that the trees analyzed are robust and well- resolved. We use tree similarity as a correlate for relationships between genomic segments, so the trees must be robust enough to support our claims, as we have clarified in lines 128-131. We initially set out to examine a broader range of viral isolates in each set of trees, but larger trees containing more isolates consistently failed to be supported by bootstrapping. Bootstrapping is by far the most widely used methodology for demonstrating support for tree nodes. We provided the closest possible example to the trees presented in this manuscript for comparison. We took all 84 H3N2 strains from 2005-2014 analyzed in replicate trees 1-7 and collapsed these sequences into one tree for each vRNA segment. Figure X-A, specifically provided for the reviewers, illustrates the resultant collapsed PB2 tree, with bootstrap values of 70 or higher shown in red and individual strains coded by cluster and replicate. As expected, the majority of internal nodes on such a tree are largely unsupported by bootstrapping, indicating that relaxing our constraint of 97% sequence identity increases the uncertainty in our trees.

      Because we agree with Reviewers #1 and #3 on the critical importance of validating our approach, we determined the distances between these new collapsed trees using a complementary approach, Clustering Information Distances (CID), that is independent of tree size (Supplemental Figure 4B and Figure X-B & X-C). Larger trees containing all sequences yielded pairwise vRNA relationships that are largely similar to those we report in the manuscript (R2 = 0.6408; P = 3.1E-07; Figure X-B vs. X-C), including higher tree similarity between PB2 and NA over NS. This observation strengthens the rationale to focus on these segments for molecular validation and correlate parallel evolution to intracellular localization in our manuscript (Figure 7). However, tree distances are generally higher in Figure X-C than in Figure X-B, which we might expect if poorly supported nodes in larger trees artificially inflate phylogenetic signal. Given the overall similarity between Figures X-B and X-C, both methods yield largely comparable results. We ultimately relied upon the more robust replicate trees with stronger bootstrap support.

      Lines 50-51: Can you please elaborate? I think this might be useful for the reader to better understand the context. Also, a brief description on functional association between different known fragments might instigate curiosity among the readers from the very beginning. At present, it largely caters to people already familiar with the biology of influenza virus.

      We have added additional information to reflect the complexity of intersegmental interactions and the current standing of the field (lines 49-52).

      Lines 95-96 Were these strains all swine-origin? More details on these lineages will be useful for the readers.

      We have clarified that all strains analyzed were isolated from humans, but were of different lineages (lines 115-120).

      Lines 128-132: I think it will be nice to talk about these hypotheses well in advance, may be in the Introduction, with more functional details of viral segments.

      We incorporated our hypotheses regarding tree similarity into the existing discussion of epistasis in the Introduction (lines 74-75 and 89-106).

      Lines 134-136: Please rephrase this sentence to make it more direct and explain the why. E.g. "... parallel evolution between PB1 and HA is likely to be weaker than that of PB1 and PA".

      The text has been modified (lines 165-168).

      Lines 222-223: Please include a set of hypotheses to explain you results? Please add a perspective in the discussion on how this contribute might to the pandemic potential of H1N!?.

      We have added in our interpretation of the results (lines 259-264) and expanded upon this in the Discussion (lines 418-422).

      Lines 287-288: I am wondering how likely is this to be true for H1N1.

      We have expanded on this in the Discussion (lines 409-410).

      Reviewer #2:

      The influenza A genome is made up of eight viral RNAs. Despite being segmented, many of these RNAs are known to evolve in parallel, presumably due to similar selection pressures, and influence each other's evolution. The viral protein-protein interactions have been found to be the mechanism driving the genomic evolution. Employing a range of phylogenetic and molecular methods, Jones et al. investigated the evolution of the seasonal Influenza A virus genomic segments. They found the evolutionary relationships between different RNAs varied between two subtypes, namely H1N1 and H3N2. The evolutionary relationships in case of H1N1 were also temporally more diverse than H3N2. They also reported molecular evidence that indicated the presence of RNA-RNA interaction driving the genomic coevolution, in addition to the protein interactions. These results do not only provide additional support for presence of parallel evolution and genetic interactions in Influenza A genome and but also advances the current knowledge of the field by providing novel evidence in support of RNA-RNA interactions as a driver of the genomic evolution. This work is an excellent example of hypothesis-driven scientific investigation.

      The communication of the science could be improved, particularly for viral evolutionary biologists who study emergent evolutionary patterns but do not specialise in the underlying molecular mechanisms. The improvement can be easily achieved by explaining jargon (e.g., deconvolution) and methodological logics that are not immediately clear to a non-specialist.

      We have clarified or eliminated jargon wherever possible throughout the text.

      The introduction section could be better structured. The crux of this study is the parallel molecular evolution in influenza genome segments and interactions (epistasis). The authors spent the majority of the introduction section leading to those two topics and then treated them summarily. This structure, in my opinion, is diluting the story. Instead, introducing the two topics in detail at the beginning (right after introducing the system) then discussing their links to reassortments, viral emergence etc. could be a more informative, easily understandable and focused structure. The authors also failed to clearly state all the hypotheses and predictions (e.g., regarding intracellular colocalisation) near the end of the introduction.

      We restructured the Introduction with more background on genomic assembly in influenza viruses, as requested by two reviewers (lines 43-52), more discussion of epistasis (lines 58-63) and provided a more thorough discussion of all hypotheses (lines 74-77, 88-92, 94-95, 97-106).

      The authors used Robinson-Foulds (RF) metric to quantify topological distance between phylogenetic trees-a key variable of the study. But they did not justify using the metric despite its well-known drawbacks including lack of biological rational and lack of robustness, and particularly when more robust measures, such as generalised RF, are available.

      We agree that RF has drawbacks. To address this, we performed a companion analysis using the Clustering Information Distance (CID) recently described by Smith, 2020. The mean CID can be found in Figure S4, the standard error of the mean in Figure S5, and networks depicting overall relationships between segments by CID in Figure S7E-S7H. To better assess how well RF and CID correlate with each other across influenza virus subtypes and lineages, we reanalyzed all data from both sets of distance measures by linear regression (Figure 3B, 4B-C, 5B, S6 and S9). Our results from both methods are highly comparable, which we believe strengthens our conclusions. Both analyses are included in the resubmission (lines 86-89; 162; 164; 187-188; 199-200; 207-208; 231-234; 242-244; 466-470).

      Figure 1 of the paper is extremely helpful to understand the large number of methods and links between them. But it could be more useful if the authors could clearly state the goal of each step and also included the molecular methods in it. That would have connected all the hypotheses in the introduction to all the results neatly. I found a good example of such a schematic in a paper that the authors have cited (Fig. 1 of Escalera-Zamudio et al. 2020, Nature communications). Also this methodological scheme needs to be cited in the methods section.

      We provided the molecular methods in a schematic in Figure 1D and the figure is cited in the Methods (lines 310; 440; 442; 456; 501).

      Finally, I found the methods section to be difficult to navigate, not because it lacked any detail. The authors have been excellent in providing a considerable amount of methodological details. The difficulty arose due to the lack of a chronological structure. Ideally, the methods should be grouped under research aims (for example, Data mining and subsampling, analysis of phylogenetic concordance between genomic segments, identifying RNA-RNA interactions etc.), which will clearly link methods to specific results in one hand and the hypotheses, in the other. This structure would make the article more accessible, for a general audience in particular. The results section appeared to achieve this goal and thus often repeat or explain methodological detail, which ideally should have been restricted to the methods section.

      We organized the Methods section by research aims as suggested. However, some discussion of the methods were retained in the Results section to ensure that the manuscript is accessible to audiences without formal training in phylogenetics.

      Reviewer #3:

      The authors sought to show how the segments of influenza viruses co-evolve in different lineages. They use phylogenetic analysis of a subset of the complete genomes of H3N2 or the two H1N1 lineages (pre and post 2009), and use a method - Robinson-Foulds distance analysis - to determine the relationships between the evolutionary patterns of each segment, and find some that are non-random.

      1) The phylogenetic analysis used leaves out sequences that do not resolve well in the phylogenic analysis, with the goal of achieving higher bootstrap values. It is difficult to understand how that gives the most accurate picture of the associations - those sequences represent real evolutionary intermediates, and their inclusion should not alter the relationships between the more distantly related sequences. It seems that this creates an incomplete picture that artificially emphasizes differences among the clades for each segment analyzed?

      Reviewer #1 raised the same concern. Please refer to our response at the beginning of this letter where we address this issue in depth.

      2) It is not clear what the significance is of finding that sequences that share branching patterns in the phylogeny, and how that informs our understanding of the likelihood of genetic segments having some functional connection. What mechanism is being suggested - is this a proxy for the gene segments having been present in the same viruses - thereby revealing the favored gene segment combinations? Is there some association suggested between the RNA sequences of the different segments? The frequently evoked HA:NA associations may not be a directly relevant model as those are thought to relate to the balance of sialic acid binding and cleavage associated with mutations focused around the receptor binding site and active site, length of NA stalk, and the HA stalk - does that show up in the overall phylogeny of the HA and NA segments? Is there co-evolution of the polymerase gene segments, or has that been revealed in previous studies, as is suggested?

      We clarified our working hypotheses in the Introduction (lines 89-106) and what is known about the polymerase subunits (lines 92-93). Our data do suggest that polymerase subunits share similar evolutionary trajectories that are more driven by protein than RNA (lines 291-293; Figure 2A and 6). The point about epistasis between HA and NA arising from indirect interactions is entirely fair, but these studies are nonetheless the basis for our own work. We have clarified the distinction between these prior studies and our own in the text (lines 60-63 and 74-75). Moreover, our protein trees built from HA and NA recapitulate what has been shown previously, which we highlight in the text (lines 293-296; Figure 6 and Figure S10). We also clarified our interpretation of tree similarity throughout the text (lines 165-168; 190-191; 261-264; 323-326; 419-423).

      The mechanisms underlying the genomic segment associations described here are not clear. By definition they would be related to the evolution of the entire RNA segment sequence, since that is being analyzed - (1) is this because of a shared function (seems unlikely but perhaps pointing to a new activity), or is it (2) because of some RNA sequence-associated function (inter-segment hybridization, common association of RNA with some cellular or viral protein)? (3) Related to specific functions in RNA packaging - please tell us whether the current RNA packaging models inform about a possible process. Is there a known packaging assembly process based on RNA sequences, where the association leads to co-transport and packaging - in that case the co-evolution should be more strongly seen in the region involved in that function and not elsewhere? The apparent increased association in the cytoplasm of the subset of genes examined for the single virus looks mainly in the cytoplasm close to the nucleus - suggesting function (2) and/or (3)?.

      It is difficult to figure out how the data found correlates with the known data on reassortment efficiency or mechanisms of systems for RNA segment selection for packaging or transport - if that is not obvious, maybe you can suggest processes that might be involved.

      We provided more context on genomic packaging in the Introduction, including the current model in which direct RNA interactions are thought to drive genomic assembly (lines 43-53). Although genomic segments are bound by viral nucleoprotein (NP), accurate genomic assembly is theorized to be a result of intersegment hybridization rather than driven by viral or cellular protein. We further clarified our hypotheses regarding the colocalization data in the Results section to make the proposed mechanism clearer (lines 313-326).

    1. Author Response:

      Evaluation Summary:

      The study provides evidence that specific transcriptional responses may underpin the observation that metabolic rates often scale inversely with body mass. The conclusions are supported by direct measurement of metabolic fluxes in mouse and rat livers, although generalizations to other settings remain to be rigorously tested. The study has broad implications for researching and studying animal metabolism and physiology.

      We thank the reviewers and editors for this summary. We are pleased that they agree that the conclusions “are supported by direct measurements of metabolic fluxes in mouse and rat livers,” and that “the study has broad implications for researching and studying animal metabolism and physiology. While we fully agree that “generalizations to other settings remain to be rigorously tested,” we have now added a comment comparing our measured liver fluxes in rodents to those recently measured in people:

      “While we did not have the capacity to measure liver fluxes in larger mammals in the current study, endogenous glucose production, VPC, and VCS previously measured using PINTA were 50-60% lower in overnight fasted humans than in rats (Petersen et al., 2019), assuming a liver size of 1,500 g in humans.”

      Reviewer #1 (Public Review):

      It is well established that the energy expenditure and metabolic rate of metazoan organisms scale inversely to body mass, based on the measurement of oxygen consumption and caloric intake. However, the underlying regulatory mechanisms for this observation are poorly defined. To investigate whether metabolic scaling is associated with reduced levels of transcription of metabolic genes in larger animals, the authors reviewed existing transcriptional datasets from liver tissues of five animals (mice, rats, monkeys, humans and cattle) with a 30,000-fold range in average adult body weights. They identified a number of metabolic genes in different pathways of central carbon metabolism whose expression inversely scaled with body size, a majority of which required oxygen, NAD/H or ATP/ADP. Metabolic flux studies on intact liver sections, as well as in live animals also revealed decreased liver metabolic fluxes in rats compared to mice. Interestingly, these differences were not observed in primary hepatocyte cultures, indicating that metabolic scaling is primarily regulated by cell-extrinsic factors and tissue context. These are interesting findings and highlight the importance of measuring metabolic processes in vivo. The measurement of cellular metabolic fluxes in different contexts (cultured, ex vivo tissue sections and live animals) is a major strength of this study. The lack of direct evidence that enzyme levels correlate with mRNA, and the absence of both transcriptional and enzyme activity measurements in cultured cells are potential weaknesses.

      We are delighted, and thank Reviewer #1 for stating that “These are interesting findings and highlight the importance of measuring metabolic processes in vivo” and that “The measurement of cellular metabolic fluxes in different contexts (cultured, ex vivo tissue sections and live animals) is a major strength of this study.” In addition, we sincerely thank the reviewer for raising important weaknesses related to the importance of proteomics, transcriptional and enzyme activity measurements in cultured cells, and are pleased to have had the opportunity to add data to address each of these points.

      Reviewer #2 (Public Review):

      Akingbesote et al. aim to determine the molecular basis of metabolic scaling - the phenomenon that metabolic rates scale inversely with (0.75) body mass. More specifically, they test the hypothesis that expression of genes involved in the regulation of oxygen consumption and substrate metabolism as well as respective fluxes provide a molecular basis for metabolic scaling across five species: mice, rats, monkeys, humans, and cattle. To this end, Akingbesote et al. use publicly available transcriptomics data and identify genes that show decreasing (normalized) expression with increasing mass of organisms. This descriptive analysis is followed by discussing a few relevant examples and (KEGG) pathway enrichment analysis. The authors then used their published PINTA approach with data from their experiments with mice and rats to provide estimates of selected cytosolic and mitochondrial fluxes in vitro, ex vivo, and in vivo; these estimates are then employed in determining if metabolic fluxes scale. The conclusion drawn from these analyses is that estimates of selected fluxes do not differ in vitro between plated hepatocytes of mice and rats, but that differences can be detected using metabolic flux analysis in vivo. As a result, in vivo flux profiling is more relevant to assessing metabolic scaling.

      The conclusions are only in part supported by the data and clarifications are needed both with respect to the analysis of transcriptomics data as well as flux estimates:

      1. In looking for scaling in gene expression, the authors rely on the assumption that mRNA expression correlates well with protein abundance (citing Schwanhäusser et al., 2011); however, transcripts explain about 40% of variance in protein abundance (this observation holds across multiple species). Hence, the identified patterns based on the transcript data may have little implications for protein abundance or flux.

      We agree that, despite the data in the cited publication, gene expression should not be assumed to directly correlate with protein expression, and the two certainly cannot be assumed – without data to equate to metabolic flux. We have removed the citation, and replaced it with proteomics data. Half of the genes available in the proteomics analysis which were found to correlate negatively with body size in our liver transcriptomics analysis also correlated negatively with body size at the level of liver protein expression:

      Author Response Figure 1

      Additionally, we analyzed available proteomics assessment of left ventricular expression of the three proteins observed to correlate negatively with body mass in the liver proteomics analysis. One of the three genes observed to correlate negatively with body mass in the proteomics analysis of liver, GLUL, was also shown to correlate negatively with body mass when its expression was assessed in the heart:

      Author Response Figure 2

      However, as discussed in our response to the editor’s point 1, we are limited by the available data, and fully acknowledge that without the capacity to statistically compare groups, we cannot make conclusive statements regarding the proteomics data.

      Additionally, we have substantially softened the description of the implications of the transcriptomics data in the Abstract, Introduction, and Discussion, including: - Editing “Together, these data reveal that metabolic scaling extends beyond oxygen consumption to numerous other metabolic pathways, and is likely regulated at the level of gene and protein expression, enzyme activity, and substrate supply” to add the parameters in red. - Removing “Considering that mRNA expression correlates well with protein expression under basal conditions, especially for metabolic genes (Schwanhäusser et al., 2011), we used mRNA expression as a proxy for the relative abundance of metabolic enzymes.” - Added “Further analysis of liver proteomics revealed that approximately half of the genes in liver that scaled at the transcriptional level also scaled at the level of protein expression,” now linking gene expression to protein expression to metabolic flux. - Editing “Numerous metabolic genes…followed the pattern of metabolic scaling, and informed our isotope tracer based in vitro and in vivo metabolic flux studies” to “Numerous metabolic genes…followed the pattern of metabolic scaling. Further analysis of liver proteomics revealed that approximately half of the genes in liver that scaled at the transcriptional level also scaled at the level of protein expression. To determine if gene and protein expression would correlate with scaling at the level of metabolic flux, we performed a comprehensive assessment of liver metabolism in vivo and in vitro using modified Positional Isotopomer NMR Tracer Analysis (PINTA)…” - Edited “Taken together, this study demonstrates systems regulation of metabolic scaling: gene expression in livers showed that scaling occurs to regulate oxygen consumption and substrate supply, isotope-based tracer studies in mice and rats demonstrated the mechanistic function of these enzymes in vivo which was only apparent in the living organism rather than plated cells” to “Taken together, this study demonstrates systems regulation of the ordering of metabolic fluxes according to body size, and provides unique insight into the regulation of metabolic flux across species.” - Removed “Interestingly, the scaling of GPT and ADIPOR1 further suggest that there is dependence on extra-hepatic organs in the scaling of in vivo gluconeogenesis and fatty acid oxidation: that is, skeletal muscle supply of alanine for the liver mediated glucose-alanine cycle and adipose tissue-derived adiponectin signaling. These findings also suggests that the scaling of mitochondrial mass (Porter and Brand, 1995) or mitochondrial proton leak (Porter and Brand, 1993) cannot fully explain metabolic scaling.” - Added “However, it should be noted that metabolic scaling cannot fully be explained at the transcriptional level, because many rate-limiting enzymes in the metabolic processes measured in vivo did not scale at the transcriptional level, and only approximately half of genes that scaled at the level of mRNA scaled at the level of protein. Thus, it is likely that both transcriptional and other mechanisms – such as enzyme activity – are responsible for variations in metabolic flux per unit mass, inversely proportionally to body size. Additionally, the currently available data do not allow us to assess whether expression of certain isoforms of key metabolic enzymes scale differentially across species.”

      1. While the procedure used to identify transcripts whose expression scale is clearly described, focusing the enrichment on KEGG pathways can only identify metabolic genes that scale. It would be informative and instructive to investigate if and to what extent genes involved in non-metabolic processes, that affect metabolic rates, also scale.

      We acknowledge that focusing the enrichment on KEGG pathways does enrich for the identification of metabolic processes that scale. However, we would respectfully submit that because this manuscript focuses on metabolic scaling, this seems to be the most appropriate setting in which to conduct the analysis. New data added in this revision demonstrate that three metabolic enzymes that scaled in the transcriptomics analysis also scale relative to β-actin, further suggesting that the inverse correlation of gene expression with body weight is primarily confined to metabolic processes:

      Author Response Figure 3

      In addition, we measured the expression of two structural proteins (collagenase 3 [Mmp3] and Larp6) outside of metabolic pathways, relative to β-actin (Actb), and found that neither was differentially expressed relative to actin in mice versus rats:

      Author Response Figure 4

      We recognize that these data may be confounded by the fact that Actb expression could potentially be different in mice versus rats; however, the fact that metabolic genes scale relative to β-actin (Actb) expression shows that it is unlikely that global mRNA scaling is unlikely to be the sole cause of the metabolic scaling phenotype.

      1. The result on flux ratios and absolute fluxes, based on the equations in Table S1, rely on certain assumptions (e.g. metabolic and isotopic steady state, among the others listed in PINTA); the current presentation does not ensure that all assumptions of PINTA are met in the present setting, so the estimates may be biased, leading to alternative explanations for the observed differences in vivo or the lack thereof in vitro.

      However, we fully agree with the reviewer that it is critical to ensure that key assumptions are met when presenting tracer data, and thank them for raising this important point. Thus, we have now added data demonstrating that plasma m+1, m+2, and m+7 glucose are in steady state at 100 min of the 120 min in vivo tracer infusion:

      Author Response Figure 5

      Additionally, we now show that blood glucose and plasma lactate concentrations have reached steady state as well:

      Author Response Figure 6

      With these data, we validate that the mice and rats are at metabolic and isotopic steady state by the end of the 120 min tracer infusion. We recognize that we have not validated that liver m+1 and m+2 glucose are at steady state, as that would require two additional groups of mice and rats (to sacrifice at 100 and 110 min, compared to the animals euthanized after 120 min of infusion) and introduce additional variability. Additionally, plasma m+1 and m+2 glucose come from endogenous glucose production from 13C tracer, so if m+1 and m+2 glucose are in steady state in plasma, they must be in steady state in liver.

      An additional assumption is that liver glycogen is effectively depleted after the overnight fast utilized in these studies. We have now verified this assumption by comparing fed and overnight fasted liver glycogen concentrations, and detect negligible glycogen after the fast in both rats and mice:

      Author Response Figure 7

      Additionally, we validated isotopic steady state in our hepatocytes incubated in 3-13C lactate. As expected in plated cell studies, cells reached steady state in both [13C] lactate enrichment and m+1 and m+2 glucose enrichment within 60 min. Because net glucose production is measured using the accumulation of glucose, we do not expect – and did not measure – glucose concentration at steady state, but we did confirm that the accumulation of glucose is linear throughout the 6 hr incubation (thus confirming that 6 hr is a reasonable endpoint):

      Author Response Figure 8

      We very respectfully submit that after 8 prior publications using PINTA called as such (PMID 28986525, 29307489, 29483297, 31545298, 31578240, 32610084, 32132708, 32179679), in addition to several prior publications that utilized PINTA without the acronym, it would not be the most responsible use of animals to try to prove in this manuscript that PINTA is a legitimate means of assessing substrate fluxes in the current manuscript. However, we thank the reviewer for raising the important point regarding assumptions of the method, thereby allowing us to insert data verifying that the key assumptions are met.

      1. The findings regarding the flux estimates seem to be fully determined by observed differences in gluconeogenesis (as demonstrated in Fig. 4). Usage of more involved approaches for metabolic flux analysis may provide wider-reaching conclusions beyond selected fluxes that appear fully coupled.

      Fluxes are back-calculated from total glucose production so that methodologically they are “coupled”, but this does not mean that glucose production will always mirror other flues. For example, in our 2015 manuscript using PINTA – although we had not yet named the method “PINTA” – we measured decreased endogenous glucose production (EGP) simultaneously with increased citrate synthase flux (mitochondrial oxidation, VTCA, which we have subsequently begun to call VCS in recognition of the fact that different reactions in the TCA cycle can proceed at different rates, but the calculation is the same) (Perry et al. Science 2015).

      Similarly, another study demonstrated that the same mitochondrial uncoupler (CRMP) increased VCS while EGP decreased in nonhuman primates (Goedeke et al. Sci. Transl. Med. 2019).

      These data demonstrate that, while fluxes are back-calculated from EGP with PINTA, the method is fully capable of detecting differences in oxidative fluxes without, or in the opposite direction of, changes in EGP. We very respectfully submit that we are not aware of what a more “involved” approach for metabolic flux analysis would entail, and that after the 8 prior publications listed in response to the previous point, we are not trying to validate PINTA in the current manuscript.

      Reviewer #3 (Public Review):

      This manuscript addresses a fundamental aspect of mammalian biology referred to as scaling, in which metabolic processes calibrate to the size of the organism. Longstanding observations related to scaling have been established based on rates of oxygen consumption. This manuscript extends these observations to gene expression and metabolic fluxes in order to discover the metabolic pathways that scale with body mass. The analyses are focused on the liver, which is the metabolic hub of the organism. Gene expression levels gleaned from available databases for organisms of varied sizes are analyzed and queried for scaling based on body mass. This analysis reveals that scaling is mainly a characteristic of metabolic genes. These data inform metabolic flux studies in cultured cells, liver slices and whole organisms. These studies demonstrate that scaling of metabolic fluxes occurs, but not out of the context of the whole organism or intact liver (in the form of liver slices). Scaling of metabolic fluxes is not observed in cultured hepatocytes. Overall, this is an interesting line of inquiry. The data are largely correlative in nature but add important texture to traditional characterization of oxygen consumption rates. The application of flux studies is a particular strength because these reflect the true metabolic processes. Enthusiasm was tempered by certain claims that extend beyond data (e.g., the title that suggests that metabolic scaling applies to tissues other than the liver, which was studied), as well as low numbers of biological replicates in some experiments, studies conducted in a single-gender and a writing style that includes excessive technical jargon.

      We thank the reviewer for their time spent evaluating the paper, and for their very helpful comments. We agree that “the application of flux studies is a particular strength because these reflect the true metabolic processes.” We agree that the study was focused on liver, although the previous iteration did include a small amount of white adipose tissue flux data, and have edited the manuscript to make clear that this is a liver-focused manuscript. We have now added specific numbers to each figure legend, and have also added in vivo flux measurements in female rats and mice. Additionally, the manuscript has been edited extensively. We have further detailed these modifications in our point-by-point responses to the reviewer.

    1. Author Response

      Reviewer #1 (Public Review):

      Bornstein and colleagues address an important question regarding the molecular makeup of the different cellular compartments contributing to the muscle spindle. While work focusing on single components of the spindle in isolation - proprioceptors, gamma-motor neurons, and intrafusal muscle fibres - have been recently published, a comprehensive analysis of the transcriptome and proteome of the spindle was missing and it fills an important gap considering how local translation and protein synthesis can affect the development and function of such a specialised organ.

      The authors combine bulk transcriptome and proteome analysis and identify new markers for neuronal, intrafusal, and capsule compartments that are validated in vivo and are shown to be useful for studying aspects of spindle differentiation during development. The methodology is sound and the conclusions in line with the results.

      We thank the reviewer for highlighting the importance of our study.

      I feel a bit more analysis regarding the specificity and developmental expression profiles of the identified markers would be a great addition. In particular:

      • Are any of the proprioceptive sensory neurons markers specific for fibres innervating the muscle spindles or also found in Golgi tendon organs?

      We thank the reviewer for the important question, following which we performed two additional analyses. First, in order to study the specificity of spindle afferent genes we identified, we examined the overlap between our list of 260 potential proprioceptive neuron genes and markers for the three proprioceptive neurons subtypes (Ia, II and Ib) identified by Wu and colleagues (Wu et al. 2021). As shown in the newly added Figure 1- figure supplement 2F, while we found many genes that are common to all subtypes, 69 genes exclusively overlapped with subtype markers (22 genes with type Ia neurons, 45 genes with type II neurons and 2 genes with both; lists are shown in Supplementary File 4). These results suggest that the 69 genes are expressed by muscle spindle afferents and not by GTO afferents.

      Second, to study the specificity of our validated markers, we examined the expression of ATP1a3, VCAN and GLTU1, marking proprioception neurons, extracellular matrix and outer capsule, respectively, in GTOs. Results showed that all three markers were also detected in the different tissues composing the GTOs (newly added Figure 3 – figure supplement 3, below). As ATP1a3 is not in the 69 unique marker list, this analysis verified that it is expressed by all proprioceptive neurons. The expression of both VCAN and GLUT1 in GTO capsules highlights the similarity between the capsules of the two proprioceptors.

      • On the same line are any of the gamma motor neurons markers found also in alpha?

      We thank the reviewer for raising this issue. Following the reviewer’s question, we conducted a detailed analysis of the expression of potential γ motor neuron genes. To this end, we first generated a list of α-motor neurons genes in our data by performing ranked GSEA using published expression profiles of these neurons (Blum et al., 2021). Then, we compared between the three lists of neuronal genes, i.e. γ motor neurons, α motor neurons and proprioceptive neurons (newly added Figure 1 – figure supplement 2G), and found an overlap between the three lists. Nonetheless, we also identified 40 spindle genes that are specific to γ motor neuron (Figure 1 – figure supplement 2G and Supplementary File 4) and, therefore, are potential markers for these neurons.

      • How early expression of ATP1A3 is found in neurons at the spindle or fibres starting to innervating the muscle? A couple of late embryonic timepoints would be great.

      We thank the reviewer for this suggestion. We performed late embryonic (E15.5-E17.5) staining for ATP1a3, which showed its expression as early as E15.5 (new Figure 4 – figure supplement 1).

      • Given that the approach used allows to obtain insights on whether local translation plays a major role into the differentiation of the spindle it would be interesting to assess whether the proprioceptor and gamma motor neuron markers identified are also found in the cell body or exclusively at the spindle.

      The reviewer raises an interesting question about local translation of the neuronal genes. Going through the literature, several lines of evidence indicate that the genes expressed at the neuronal end are also expressed in the neuron soma. In a study on retinal ganglion cell translatome, Holt and colleagues found that the axonal translatome is a subset of the significantly larger somal translatome (Shigeoka et al., Cell, 2016). Similarly, a study by Shuman and colleagues that compared the translatome of neuronal cell bodies, dendrites, and axons of rat hippocampal neurons showed that many common genes are translated, albeit at different levels (Glock et al., PNAS, 2021). Finally, following the reviewer’s suggestion, we studied the expression of ATP1a3 in the DRG, and found it to be expressed there as well (Figure L1). Thus, we predict that the markers we found in the neurons ends are likely also expressed in the soma. While this issue is very interesting, we believe that further validation of our assumption exceeds the scope of this study.

      Figure L1. ATP1a3 expression in the DRG. Confocal images of DRG sections from adult PValb-Cre;tdTomato mice stained for ATP1a3 (magenta). Scale bars represent 50 μm.

      Altogether, this is a novel and important work that will benefit scientists studying the neuromuscular and musculoskeletal systems by pushing the field toward an holistic understanding of the muscle spindle. These datasets in combination with the previous ones can be used to develop new genetic and viral strategies to study muscle spindle development and function in healthy and pathological states by analysing the roles and relative contributions of different components of this fascinating and still mysterious organ.

      We thank again the reviewer for highlighting the importance of our study.

      Reviewer #2 (Public Review):

      The data presented are of high quality. Through complementary experiments involving the isolation of masseter muscle spindles, the authors perform RNA-seq and proteomic analysis, and identify genes and proteins that are differentially expressed in the muscle spindle versus the adjacent muscle fiber, and proteins that accumulate specifically in capsule cells and nerve endings. These data, while essentially descriptive, provide important information about the developmental framework of the sensory apparatus present in each muscle that accounts for its tension/contraction state. The data presented thus allow for a better characterization of muscle spindles and provide the community with a set of new markers for better identification of these structures. Analysis of the expression pattern of the Tomato reporter in transgenic animals under the control of Piezo2-CRE, Gli1-CRE and Thy1-YFP reporter reinforces the findings and the specificity of the expression pattern of the specific genes and proteins identified by the multi-omics approach and further validated by immunohistochemistry.

      We thank the reviewer for the positive and encouraging feedback.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Marmor and colleagues reanalyze a previously published dataset of chronic widefield Ca2+ imaging from the dorsal cortex of mice as they learn a go/no-go somatosensory discrimination task. Comparing hit trials that have a distinct history (i.e. are preceded by distinct trial types), the authors find that hit trials preceded by correct rejections of the nontarget stimulus are associated with larger subsequent neural responses than trials precede by other hits, across the cortex. The authors analyze the time course over which this effect emerges in the barrel cortex (BC) and the rostrolateral visual area (RL), and find that its magnitude increases as the animals become expert task performers. Although the findings are potentially interesting, I, unfortunately, believe that there are important methodological concerns that could put them into question. I also disagree with the rationale that singles out BC and RL as being especially important for the emergence of trial history effects on neural responses during decision-making. I detail these points below .

      1) The authors did not perform correction for hemodynamic contamination of GCaMP fluorescence. In widefield imaging, blood vessels divisively decrease neural signals because they absorb green-wavelength photons, which could lead to crucial confounds in the interpretation of the main results because of neurovascular coupling, which lags neural activity by seconds. For example, if a reward response from the previous trial is associated with a lagged hemodynamic contamination that artificially decreases the signal in the following trial, one could get artificially higher activity in trials that were not preceded by a reward (i.e. CR), which is what the authors observed. Ideally, the experiments would be repeated with proper hemodynamic correction, but at the very least the authors should try to address this with control analyses.

      Done. We basically redone the experiment with proper hemodynamic correction and maintained trial history results. Please see point 1 above for more details (Figures S4 and S5). In addition to hemodynamic controls, we also present novel two-photon single cell data with similar results in Figure S6. We also added a dedicated section for this in the Methods section (pg. 12).

      For example, what is the time course of reward-related responses in BC and elsewhere?

      In general, and specifically in BC, reward related responses return to baseline up to 5 seconds after the start of the reward period and at least 5 seconds before the stimulus presentation of the next trial. In the novel experiments we even extended the baseline period by an additional 2 seconds just in case. Trial history information was still present with an extended inter-trial interval.

      The text now reads (pg. 4): "We further report that responses during the reward period in cortex and specifically in BC went back to baseline 4-5 seconds after the start of the reward period and 6-8 seconds before the presentation of the next stimulus (total inter-trial interval ranged between 10-12 seconds)."

      Do hemodynamics artifacts have a trial-by-trial correlation with the subsequent trial history effect?

      We have now done the proper hemodynamic control (Figure 2) and we did not find a strong effect of hemodynamic responses on trial history information.

      What is the learning time course of reward responses?

      Responses during the reward period as a function of learning were not significantly modulated. We further show the whole learning profile for BC response during the reward period in Author response image 1.

      Author response image 1.

      Response in BC averaged during the reward period (2-4 sec after texture stop) as a function of learning for each mouse separately.

      The text now reads (pg. 4): "In addition, responses in BC during the reward period were not consistently modulated as a function of learning (p>0.05; Wilcoxon signed-rank test between naïve and expert, BC response averaged during the reward period, 2-4 seconds after stimulus onset; n=7 mice). Taken together, we find that direct responses from the reward period do not effect history-related responses during the next trial."

      Note that I don't believe the FA-Hit condition analysis that the authors have already presented provides adequate control, as punishment responses are also pervasive in the cortex and therefore suffer from the same interpretational caveat. Unfortunately, I believe this is a serious methodological issue given the above. However, I will proceed to take the reported results at face value .

      We hope that our additional control analysis regarding the hemodynamic controls are satisfactory.

      2) The statistics used to assess the effect of trial history over learning are inadequate (e.g., Fig 2b). The existence of a significant effect in one condition (e.g., CR-Hit vs. Hit-Hit in expert) but not in another (e.g., same comparison in naive) does not imply that these two conditions are different. This needs to be tested directly. Moreover, the present analysis does not account for the fact that measures across learning stages are taken from the same animals. Thus, the appropriate analysis for these cases would be to first use a two-way ANOVA with repeated measures with factors of trial history and learning stage (or equivalent non-parametric test) and then derive conclusions based on post hoc pairwise tests, corrected for multiple comparisons .

      Done. We performed 2 way ANOVA as suggested and found significant history and learning effects along with a significant interaction effect for BC.

      The text now reads (pg. 4): "This difference was significant during the stim period in learning and expert phases across mice (Fig. 2b; 2-way ANOVA with repeated measures; DF (1-6) F=51 p<0.001, DF (2-12) F=18 p<0.001, DF(2-12) F=5 p<0.05 for trial history, learning and the interaction between trial history and learning; Post hoc Tukey analysis p<0.05 for trial history in learning and expert phases; p>0.05 in the naïve phase)."

      3) I am not convinced that BC and RL are especially important for trial-history-dependent effects. Figures 4 and 5 suggest that this modulation is present across the cortex, and in fact, the difference between CR-Hit and Hit-Hit in some learning stages appears stronger in other areas. BC and RL do have the highest absolute activity during the epochs in Figs 4 and 5, but I would argue that this is likely due to other aspects of the task (e.g., touch) and therefore is not necessarily relevant to the issue of trial history .

      Done. First, we would like to point out that RL during the pre period displays the largest difference between the CR-Hit and Hit-Hit conditions (Fig. 5c bottom). Second, we now show difference maps (i.e., activity in CR-Hit minus Hit-Hit) which clearly show a positive activity patch in BC during the stim period for 5 out of the 7 mice (Fig. S10a). Example maps also highlight RL during the pre period (Fig. S10b). We note that activity patches somewhat spread over to other areas and also slightly vary across mice. This is why the grand average may slightly average out trial history information. Taken together, we strongly feel that during the pre period, trial history information emerges in RL (and adjacent posterior association areas) which shift towards BC during the stim period

      Nevertheless, we agree with the reviewer that other areas (that do not necessarily display high activity) may encode trial history information and we now clearly report this in the text (pg. 5): "We note that other areas, e.g., different association areas, also encoded historydependent information especially during learning and expert phases. In addition, we present activity difference maps between CR-Hit and Hit-Hit conditions during the stim period (Fig. S10a). These maps clearly show the highest trial history information (i.e., difference in activity) in BC. Taken together, these results indicate that BC encodes history-dependent information that emerges during the stim period and just after learning. "

      And also in (pg. 6): " In addition, we present activity difference maps between CR-Hit and HitHit conditions during the pre period (Fig. S10b). These maps localize trial history information to RL which also spreads to other adjacent association areas. Moreover, activity patches slightly vary across the different mice which may affect the grand average (averaged across mice) of each area."

      4) Because of similar arguments to the above, and because this was not directly assessed, I do not believe the conclusion that history information emerges in RL and is transferred to BC is warranted. For instance, there is no direct comparison between areas, but inspection of the ROC plots in Fig 6b suggests that history information emerges concomitantly across cortical areas. I suggest directly comparing the time course between these and other areas

      Done. We now add example history AUC maps and quantify history AUC for all 25 areas during the pre and stim periods. During the pre period (Fig. 6), AUC values are concentrated around the RL (and other PPC areas), whereas during the stim periods AUC values shift to BC. Again, due to the inter-mouse variability, these differences are slightly averaged out which also makes it tough to have strong statistical test (with only 7 mice).

      The text now reads (pg. 7): "We next calculated the history AUC for each pixel during either the pre or stim period. The history AUC maps during the pre period display AUC values around the RL areas (Fig. 6f). In contrast, the history AUC maps during the stim period display AUC values mostly in BC (Fig. 6g). Quantified across 25 areas and averaged across mice, RL displays the highest history AUC during the pre period, whereas BC displays the highest history AUC values during the stim period (Fig. 6h). We note that other cortical areas such as other association areas also display high history AUC values. Taken together, we find that trial history emerges in RL before the texture arrives and then shifts to BC during stimulus presentation. "

      5) How much is task performance itself modulated by trial history? How does this change over the course of learning? These behavioral analyses would greatly help interpret the neural findings and how this trial history might be used behaviorally .

      Done, we have now calculated the dprime for Hit-Hit and CR-Hit trials separately. We find no significant differences between conditions both within and across mice (see Fig. S2 below).

      The text now reads pg. 3): "We note that learning curves that are calculated separately for each pair (i.e., either a preceding Hit or CR trial) were not significantly different (Fig. S2)."

      Reviewer #2 (Public Review):

      Marmor et al. mine a previously published dataset to examine whether recent reward/stimulus history influences responses in sensory (and other) cortices. Bulk L2/3 calcium activity is imaged across all of the dorsal cortex in transgenic mice trained to discriminate between two textures in a go/no-go behavior. The authors primarily focus on comparing responses to a specific stimulus given that the preceding trial was or was not rewarded. There are clear differences in activity during stimulus presentation in the barrel cortex along with other areas, as well as differences even before the second stimulus is presented. These differences only emerge after task learning. The data are of high quality and the paper is clear and easy to follow. My only major criticism is that I am not completely convinced that the observed difference in response is not due to differences in movement by the animal on the two trial types. That said, the demonstration of differences in sensory cortices is relatively novel, as most of the existing literature on trial history effect demonstrates such differences only in higher-order areas .

      Major :

      1a) The claim that body movements do not account for the results is in my view the greatest weakness of the paper - if the difference in response simply reflects a difference in movement, perhaps due to "excitement" in anticipation of reward after not receiving one on CR-H vs. HH trials, then this should show up in movement analysis. The authors do a little bit of this, but to me, more is needed .  

      Done. We have now extensively and carefully analyzed body and whisker movements for CRHit and Hit-Hit conditions. First, In the figure below we decomposed body movements into 22 different body parts using DeepLabCut. In short, we find no significant difference between CRHit and Hit-Hit conditions in each body part separately (Fig. S7 below). This was true for the naïve, learning and expert phases. Please see additional analyses in the points below.

      This is now reported in the text (pg. 4): “In addition, we performed a more detailed body and whisker analysis, e.g., decomposing the movement to different body parts and obtaining single whisker dynamics. These analyses did not find significant differences in movement parameters between CR-Hit and Hit-Hit conditions (Fig. s7 and s8).”

      First, given the small sample size and use of non-parametric tests, you will only get p<.05 if at least 6 of the 7 mice perform in the same way. So getting p>.05 is not surprising even if there is an underlying effect. This makes it especially important to do analyses that are likely to reveal any differences; using whisker angle and overall body movement, which is poorly explained, is in my opinion insufficient. An alternative approach would be to compare movements within animals; small as the dataset is, it is feasible to do an animal-by-animal analysis, and then one could leverage the large trial count to get much greater statistical power, foregoing summary analyses that pool over only n=7 .

      We agree with this point and are have now dramatically improved our statistical analysis.

      1) We now perform within mouse statistics for responses in BC during naïve, learning and expert (see Fig. S4 below). In short, we find statistical significance for 7 out of 7 mice during the expert phase, 6 out of 7 mice in the learning phase and 0 out of 7 in the naive phase. For RL during the pre period we find significant difference in 5 out of 7 expert mice.

      This is now reported in the text (pg. 4): "In addition, a statistical comparison between CR-Hit and Hit-Hit responses within each mouse separately maintained significance for expert (7/7 mice Mann-Whitney U-test p<0.05) and learning (6/7 mice) but not for naïve (0/7 mice. Fig. S3)."

      And also in (pg. 5): "In addition, a statistical comparison between CR-Hit and Hit-Hit responses in RL within each mouse separately maintained significance for expert (5/7 mice; MannWhitney U-test p<0.05)."

      2) We would like to point out that we have now added 3 additional mice (with hemodynamics control) and performed within mouse statistics in BC and RL (Fig. S5), adding to our initial observations.

      3) In terms of body movements, we now performed within mice statistics and compared body movements between CR-Hit and Hit-Hit conditions. In general, most mice did not show a significant difference in body movements or whisker envelope.

      This is now reported in the text (pg. 4): "A within mouse statistical comparison between body or whisker parameters in CR-Hit and Hit-Hit maintained a non-significant difference in expert (1/7 mice displayed a significant difference; Mann-Whitney U-test p>0.05), learning (2/7 mice) and naïve (0/7 mice)."

      And also in (pg. 4): "Body movements and whisker parameters did not significantly differ between CR-Hit and Hit-Hit conditions during the pre-period (Similar to the stim period. Across and within mice. P>0.05; Mann-Whitney U-test)."

      In summary, we have now substantially improved our statistical analysis and further decomposed the body movements, maintaining the trial history results.

      The authors only consider a simple parametrization of movement (correlation across successive frames), and given the high variability in movement across animals, it is likely that different mice adopt different movements during the task, perhaps altering movement in specific ways. Aggregating movement across different body parts after an analysis where body parts are treated separately seems like an odd choice - perhaps it is fine, but again, supporting evidence for this is needed. As it stands, it is not clear if real differences were averaged out by combining all body parts, or what averaging actually entails .

      Please see the above point where we decomposed body movements (Fig. S7 and Methods section in Pg. 14).

      If at all possible, I would recommend examining curvature and not just the whisker angle, since the angle being the same is not too surprising given that the stimulus is in the same place. If the animal is pressing more vigorously on CR-H trials, this should result in larger curvature changes .

      Done. We now decompose whisker dynamics (i.e., curvature) using DeepLabCut (Fig. S8 see below). In general, we find no significant differences in whisker parameters between Hit-Hit and CR-Hit conditions.

      This is now reported in the text (pg. 4): "In addition, we performed a more detailed body and whisker analysis, e.g., decomposing the movement to different body parts. This analysis did not find significant differences between CR-Hit and Hit-Hit conditions (Fig. S7 and S8)."

      Finally, the authors presumably have access to lick data. Are reaction times shorter on CR-H trials? Is lick count or lick frequency shorter?

      Done. We now calculated lick reaction time and lick rate and find a significant difference for the lick reaction time but not in lick rate. We show a figure below for the reviewer and report this in the text

      The text now reads (pg. 3): "In addition, the lick reaction time (but not the lick rate) between Hit-Hit and CR-Hit were significantly different (p<0.05; Wilcoxon signed-rank test) ,maybe indicating a more considered response after a previous stop signal."

      If movement differs across trial types, it is entirely plausible that at least barrel cortex activity differences reflect differences in sensory input due to differences in whisker position/posture/etc. This would mitigate the novelty of the present results .

      As detailed above, have now meticulously analyzed the whisker parameter differences between both conditions and did not find any significant differences.

      1b) Given the importance of this control to the story, both whisker and body movement tracking frames should be explicitly shown either in the primary paper or as a supplement. Moreover, in the methods, please elaborate on how both whisker and body tracking were performed .

      Done. Please see Figs. S7 and S8 for tracking frames. This is now detailed in the above points and also the revised relevant methods section

      2) .Did streak length impact the response? For instance, in Fig. 1f "Learning", there is a 6-trial "no-go" streak; if the data are there, it would be useful to plot CR-H responses as a function of preceding unrewarded trials.

      Done. We have now calculated response in CR-Hit as a function of the number of preceding CRs. In general, we obtain inconsistent results across mice that may be due to the small number of trials that have more than one preceding CR. Nevertheless, some mice have a trend, sometimes significant, in which CR-Hit responses are higher for longer CR preceding streaks. This is especially true during the learning phase. We have decided not to include this in the manuscript and present this figure only to the reviewer.

    1. Author response:

      Reviewer #1 (Public Review):

      This is an important and very well conducted study providing novel evidence on the role of zinc homeostasis for the control of infection with the intracellular bacterium S. typhimurium also disentangling the underlying mechanisms and providing clear evidence on the importance of spatio-temporal distribution of (free) zinc within the cell.

      We thank the reviewer for the positive comments.

      1) It would be important to provide more information on the genotype of mice.

      As suggested by the reviewer, we have added the detailed genotype of Slc30a1flagEGFP/+ and Slc30a1fl/flLysMCre mice to the revised supplementary Figure supplement 10.

      2) It is rather unlikely that C57Bl6 mice survive up to two weeks after i.p. injection of 1x10E5 bacteria.

      According to the reviewer comment, we have tested survival rate using a group of our experimental animals and C57BL/6 wild type.

      The Salmonella stain is a gift from our friend, Professor Ge Bao-xue. We have sent this stain for genetic characterisation which we found 100% identity to Salmonella enterica Typhimurium with many strains originated from poultry. One of them is Salmonella enterica subsp. enterica serovar Typhimurium strain MeganVac1 (Accession: CP112994.1), a live attenuated stain. We hope that this would support the relationship between the high infectious dose and mice survive.

      Author response image 1.

      (A) Survival rate of Slc30a1fl/fl and Slc30a1fl/flLysMCre (n = 14-15/group) and (B) Survival rate of C57BL/6 wild type (n = 8) after Salmonella infection for two weeks. (C) A fulllength sequence (1,478 bases) of 16S rDNA genes sequences of Salmonella stain and (D) the sequencing electropherogram.

      3) To be sure that macrophages Slc30A1 fl/fl LysMcre mice really have an impaired clearance of bacteria it would be important to rule out an effect of Slc30A1 deletion of bacterial phagocytosis and containment (f.e. evaluation of bacterial numbers after 30 min of infection).

      As the reviewer advised, we have repeated the experiment and measured the bacterial numbers after 30 min of infection (dashed line in A). The results show that there is no statistical difference in the bacterial numbers after 30 min between Slc30a1fl/flLysMCre and Slc30a1fl/fl BMDMs. Therefore, the reduction of bacterial numbers after 24 hours occurs due to the impairment of intracellular pathogen-killing capacity as the reviewer pointed out.

      Author respnse image 2.

      (A) Time course of the intracellular pathogen-killing capacity of Salmonellainfected Slc30a1fl/flLysMCre and Slc30a1fl/fl BMDMs measured in colony-forming units per ml (n = 5). (B) Fold change in Salmonella survival (CFU/mL) at different time points from A. (C) Representative images of Salmonella colonies on solid agar medium at 24 hours. Data are represented as mean ± SEM. P values were determined using 2-tailed unpaired Student’s t-test. P<0.05, *P<0.01, and ns, not significant.

      4) Does the addition of zinc to macrophages negatively affect iNOS transcription as previously observed for the divalent metal iron and is a similar mechanism also employed (CEBPß/NF-IL6 modulation) (Dlaska M et al. J Immunol 1999)?

      The reviewer has raised an important point here since free zinc also play a role in multiple levels of cellular signaling components (Kembe et al., 2015). Dlaska and colleague reported that NF-IL6, a protein responsible for iNOS transcription is negatively regulated by iron perturbation under IFNg/LPS stimulation in macrophages (Dlaska and Weiss, 1999). As the reviewer suggested, our results showed that zinc supplementation decreases the iNOS expression in macrophages after Salmonella infection, suggesting that free zinc might play a role in iNOS regulation.

      However, in Slc30a1fl/flLysMCre macrophages, despite increase intracellular free zinc, lacking Slc30a1 also induces Mt1, a zinc reservoir which might negatively affect NO production (Schwarz et al., 1995) or alternatively inhibits iNOS through NF-kB pathway (Cong et al., 2016) as reported by previous studies. Therefore, we couldn’t rule out the possibility that defects in Salmonella clearance due to iNOS/NO inhibition may be caused by a complex combination of excess free zinc and overexpression of the zinc reservoir. To prove this hypothesis, further studies using the specific target, for example Mtfl/fliNOSfl/flLysMCre model might be needed to investigate the precision mechanism.

      Author response image 3.

      RT-qPCR analysis of mRNA encoding Nos2 in BMDMs after infected with Salmonella and Salmonella plus ZnSO4 (20 μM) for 4 h.

      Reference:

      Dlaska M, Weiss G. 1999. Central role of transcription factor NF-IL6 for cytokine and ironmediated regulation of murine inducible nitric oxide synthase expression. The Journal of Immunology. 162:6171-6177, PMID: 10229861

      Kambe T, Tsuji T, Hashimoto A, Itsumura N. 2015. The physiological, biochemical, and molecular roles of zinc transporters in zinc homeostasis and metabolism. Physiological Reviews. 95:749-784. https://doi: 10.1152/physrev.00035.2014, PMID: 26084690

      Schwarz MA, Lazo JS, Yalowich JC, Allen WP, Whitmore M, Bergonia HA, Tzeng E, Billiar TR, Robbins PD, Lancaster JR Jr, et al. 1995. Metallothionein protects against the cytotoxic and DNA-damaging effects of nitric oxide. Proceedings of the National Academy of Sciences of the United States of America. 92: 4452-4456. https://doi: 10.1073/pnas.92.10.4452, PMID: 7538671

      Cong W, Niu C, Lv L, Ni M, Ruan D, Chi L, Wang Y, Yu Q, Zhan K, Xuan Y, Wang Y, Tan Y, Wei T, Cai L, Jin L. 2016. Metallothionein prevents age-associated cardiomyopathy via inhibiting NF-κB pathway activation and associated nitrative damage to 2-OGD. Antioxidants & Redox Signaling. 25: 936-952. https://doi: 10.1089/ars.2016.6648, PMID: 27477335

      5) How does Zinc or TPEN supplementation to bacteria in LB medium affect the log growth of Salmonella?

      We found that zinc supplementation at both low (20 µM) and high (640 µM) concentrations negatively effects Salmonella growth, especially during log phase and stationary phase in the broth culture medium, but not TPEN (20 µM) supplementation. These indicates that high zinc conditions occur at cellular levels such as within phagosomes (Botella et al., 2011) can limit bacterial growth.

      Author response image 4.

      Growth curve (optical density, OD 600 nm) of Salmonella in LB medium at different concentrations of ZnSO4 and/or TPEN. Bar graph indicating Salmonella growth at specific time points. Each value was expressed as mean of triplicates for each testing and data were determined using 2-tailed unpaired Student’s t-test. P<0.05, P<0.01, **P<0.001 and ns, not significant.

      Reference:

      Botella H, Peyron P, Levillain F, Poincloux R, Poquet Y, Brandli I, Wang C, Tailleux L, Tilleul S, Charrière GM, Waddell SJ, Foti M, Lugo-Villarino G, Gao Q, Maridonneau-Parini I, Butcher PD, Castagnoli PR, Gicquel B, de Chastellier C, Neyrolles O. 2011. Mycobacterial p(1)-type ATPases mediate resistance to zinc poisoning in human macrophages. Cell Host Microbe. 10:248-59. https://doi: 10.1016/j.chom.2011.08.006, PMID: 21925112

      Reviewer #2 (Public Review):

      This paper explores the importance of zinc metabolism in host defense against the intracellular pathogen Salmonella Typhimurium. Using conditional mice with a deletion of the Slc30a1 zinc exporter, the authors show a critical role for zinc homeostasis in the pathogenesis of Salmonella. Specifically, mice deficient in Slc30a1 gene in LysM+ myeloid cells are hypersusceptible to Salmonella infection, and their macrophages show alter phenotypes in response to Salmonella. The study adds important new information on the role metal homeostasis plays in microbe host interactions. Despite the strengths, the manuscript has some weaknesses. The authors conclude that lack of slc30a1 in macrophages impairs nos2-dependent anti-Salmonella activity. However, this idea is not tested experimentally. In addition, the research presented on Mt1 is preliminary. The text related to Figure 7 could be deleted without affecting the overall impact of the findings.

      We thank the reviewer for his/her positive comments and constructive suggestions.

      Reviewer #3 (Public Review):

      Na-Phatthalung et al observed that transcripts of the zinc transporter Slc30a1 was upregulated in Salmonella-infected murine macrophages and in human primary macrophages therefore they sought to determine if, and how, Slc30a1 could contribute to the control of bacterial pathogens. Using a reporter mouse the authors show that Slc30a1 expression increases in a subset of peritoneal and splenic macrophages of Salmonella-infected animals. Specific deletion of Slc30a1 in LysM+ cells resulted in a significantly higher susceptibility of mice to Salmonella infection which, counter to the authors conclusions, is not explained by the small differences in the bacterial burden observed in vivo and in vitro. Although loss of Slc30a1 resulted in reduced iNOS levels in activated macrophages, the study lacks experiments that mechanistically link loss of NO-mediated bactericidal activity to Salmonella survival in Slc30a1 deficient cells. The additional deletion of Mt1, another zinc binding protein, resulted in even lower nitrite levels of activated macrophages but only modest effects on Salmonella survival. By combining genetic approaches with molecular techniques that measure variables in macrophage activation and the labile zinc pool, Na-Phattalung et al successfully demonstrate that Slc30a1 and metallothionein 1 regulate zinc homeostasis in order to modulate effective immune responses to Salmonella infection. The authors have done a lot of work and the information that Slc30a1 expression in macrophages contributes to control of Salmonella infection in mice is a new finding that will be of interest to the field. Whether the mechanism by which SLC30A1 controls bacterial replication and/or lethality of infection involves nitric oxide production by macrophages remains to be shown.

      We very much appreciate the reviewer’s detailed evaluation and suggestions. The manuscript has been revised thoroughly according to the reviewer’s advice.

    1. Author Response

      Reviewer #1 (Public Review):

      This work focuses on the mechanisms that underlie a previous observation by the authors that the type VI secretion system (T6SS) of a Pseudomonas chlororaphis (Pchl) strain can induce sporulation in Bacillus subtilis (Bsub). The authors bioinformatically characterize the T6SS system in Pchl and identify all the core components of the T6SS, as well as 8 putative effectors and their domain structures. They then show that the Pchl T6SS, and in particular its effector Tse1, is necessary to induce sporulation in Bsub. They demonstrate that Tse1 has peptidoglycan hydrolase activity and causes cell wall and cell membrane defects in Bsub. Finally, the authors also study the signaling pathway in Bsub that leads to the induction of sporulation, and their data suggest that cell wall damage may lead to the degradation of the anti-sigma factor RsiW, leading to activation of the extracellular sigma factor σW that causes increased levels of ppGpp. Sensing of high ppGpp levels by the kinases KinA and KinB may lead to phosphorylation of Spo0F, and induction of the sporulation cascade.

      The findings add to the field's understanding of how competitive bacterial interactions work mechanistically and provide a detailed example of how bacteria may antagonize their neighbors, how this antagonism may be sensed, and the resulting defensive measures initiated.

      While several of the conclusions of this paper are supported by the data, additional controls would bolster some aspects of the data, and some of the final interpretations are not substantiated by the current data.

      • The Bsub signaling pathway that is proposed is intricate and extensive as shown in Fig 5A. However, the data supporting that is very sparse:

      a) The authors show no data showing that the proteases PrsW and/or RasP, or the extracellular sigma factor σW are necessary, or that the cleavage of RsiW is needed, for induction of sporulation - this could presumably be tested using mutants of those genes.

      It has been previously demonstrated that the proteases PrsW and/or RasP cleave RsiW under certain conditions such as alkaline-shock (Heinrich et al., 2009). In first place, PrsW cleaves RsiW and the resulting cleaved-RsiW serves as substrate to RasP. In the previous version of the manuscript, we already demonstrated that treatment with Tse1 causes damage to PG and delocalization of RsiW, however as the reviewer comments we did not show the participation of any of these proteases in the proposed signaling pathway. We have now generated single mutants in rsiW and prsW and they have been treated with Tse1. We have observed no variation in the levels of sporulation compared to untreated strains (Figure 1) a finding according to their suggested implication in the sporulation signaling pathway activated by Tse1. Positive controls, that is the single mutants grown at 37ºC, were still able to sporulate. This data has been added to Figure 6B in the new version of the manuscript.

      As suggested by other reviewers, we have generated a sister plot of this figure showing the raw CFUs in each case. These data are included in Supplementary file 3. This experiment and the related figure have been incorporated into the new version of the manuscript.

      Figure 1. A) Quantification of the percentage of sporulated Bsub, rsiW and prsW cells after treatment with purified Tse1 showing that rsiW and prsW single mutants are blind to the presence of Tse1. B) Cell density (CFUs/mL) of total (blue bars) and sporulated population (brown bars) of different Bacillus strains (Bsub, ∆rsiW and ∆prsW) untreated and treated with Tse1. Sporulation at 37ºC is shown as positive control in each strain. Statistical significance was assessed via t-tests. p value < 0.1, p value < 0.001, **p value < 0.0001.

      Similarly, they don't demonstrate that the levels of ppGpp increase in the cell upon exposure to Pchl.

      We have not been able to measure the levels of ppGpp, however, given that in the same proposed sporulation cascade the levels of different nucleotides are altered (Kriel et al., 2013, Tojo et al., 2013, López and Kolter, 2010), we have alternatively analyzed the levels of ATP using an ATP Determination Kit (Thermo, A22066). We have found that ATP levels increased by 3-fold in Bsub cells treated with Tse1 compared to untreated control cells. Consistently, no increase in ATP levels were observed in rsiW or prsW mutants treated with Tse1. We have incorporated all the raw luminescence data obtained for each sample and treatment in Figure 6-source data 1. This experiment, figures (Figure 6A in the new version of the manuscript) and description in “Materials and Methods” have been added to the new version of the manuscript.

      c) There is some data showing that kinA and kinB mutants don't induce sporulation (Fig supplement 7A), but that is lacking the 'no attacker' control that would demonstrate an induction.

      We have included in the new version of the manuscript the ‘no attacker’ control sporulation (%). The figure shows that the presence of Pchl strains induces the sporulation of all kinase mutants. This new data has been incorporated in Figure 6-figure supplement 1A in the new version of the manuscript.

      d) There is some data showing that RsiW may be cleaved (Fig 5C, D), but that data would benefit from a positive control showing that the lack of YFP foci is seen in a condition where RsiW is known to be cleaved, as well as from a time-course showing that the foci are present prior to the addition of Tse1, and then disappear. As it is shown now, it is possible that the addition of Tse1 just blocks the production of RsiW or its insertion into the membrane (especially given the membrane damage seen). Further, there is no data that the disappearance of the YFP loci requires the proteases PrsW and /or RasP - such data would also support the idea that the disappearance is due to cleavage of RsiW.

      Thank you for your useful suggestion. It is important to consider that we have not seen repression of the expression of genes that encode any of the two proteases on cells treated with Tse1 in our transcriptomics analysis. However, we agree that additional experiments would enhance the significance of our findings. We have repeated the whole experiment including a positive control to demonstrate that YFP foci disappears in a condition in which RsiW is known to be degraded by PrsW and RasP. Bacillus cells have been incubated in medium at pH 10 which provokes an alkaline shock that triggers RsiW cleavage (Asai, 2017; Heinrich et al., 2009). As shown in Fiugre 6D under this condition we also observed disappearance of YFP foci . We have also provided extra images with quantification of average signal from YFP-foci in Figure 6-figure supplement 2 .

      • The entire manuscript suggests that T6SS is solely responsible for the induction of sporulation. While T6SS does appear to play a major part in explaining the sporulation induction seen, in the absence of 'no attacker' controls for Fig. 2A, it is impossible to see this. From the data shown in Fig. 2C, and figure supplement 2A, the 'no attacker' sporulation rate seems to be ~20%, while the rate is ~40% with Pchl strains lacking T6SS, suggesting that an additional factor may be playing a role.

      This must be a misunderstanding of the message of this manuscript. The conceptual fundament of this study was settled in our previous manuscript (Molina-Santiago et al., 2019). We demonstrated that B. subtilis sporulated in the presence of P. chlororaphis. Interestingly, the overgrowth of P. chlororaphis over B. subtilis colony did not eliminate cells of B. subtilis, given that most of them were sporulated. The data we obtained strongly suggested that a functional T6SS was involved in the cellular response of Bacillus in the close cell to cell contact. In this new manuscript, we have explored this idea, and found that indeed, the T6SS of P. chlororaphis mobilized at least one effector, Tse1, which is able to trigger sporulation in Bacillus. Thus we did not conclude, and neither have done in this new study, that T6SS is the only factor expressed by P. chlororaphis responsible for sporulation activation in Bacillus. We have accordingly rephrased some sentences of the manuscript to clarify the proposed implication of T6SS in B. subtilis sporulation.

      In addition, as mentioned above, we have included data of sporulation percentages in the absence of an attacker to better compare the induction of sporulation observed in the presence of the different Pchl strains and in the presence of Tse1.

      Reviewer #2 (Public Review):

      In a previous study, the authors showed that cell-cell contact with Pseudomonas chlororaphis induces sporulation in Bacillus subtilis. Here, the authors build on this finding and elucidate the mechanism behind this observation. They describe the enzymatic activity of a protein (Tse1) secreted by the type VI secretion system (T6SS) of P. chlororaphis (Pch), which partially degrades the peptidoglycan (PG) of targeted B. subtilis cells and triggers a signal cascade culminating in sporulation.

      Most of the key conclusions of this paper (Tse1 being secreted by the T6SS and inducing sporulation in targeted cells) are well supported by the data. One conclusion (sporulation response being an anti-T6SS "defense" strategy) is not well supported by the data and should be removed or rephrased.

      The authors elucidate the enzymatic activity of Tse1, a T6SS effector protein, in a genus (Pseudomonas) of great interest to microbiologists, and to researchers studying the T6SS specifically. They also carefully dissect the cellular response (signal cascade and sporulation) of an important model organism (B. subtilis; Bsub) specifically to exposure to Tse1. The results describing this cellular response contribute substantially to our understanding of how T6SS effector proteins interact with cells of Gram-positive species.

      My only major concerns regard the interpretation of these results as sporulation being an adaptive and/or specific response to attacks by the T6SS. I outline my reasoning below.

      • Interpretation of sporulation as a "defense" mechanism/strategy against the T6SS. In order for a phenotype X to be regarded as a "defense against Y" mechanism, it has to be shown that phenotype X (sporulation in response to Tse1) evolved - at least in part - for the purposes of increasing survival in the presence of Y (T6SS attacker). There are no experiments in this study comparing e.g. a sporulating Bsub with a non-sporulating Bsub, that would allow testing if sporulation increases survival. The experiments carefully describe the cellular response to Tse1, but no inference can be made with regards to this being adaptive for Bsub, or if it helps the cells survive against T6SS attacks, etc. A more parsimonious explanation would be that Tse1 happens to target the PG and causes envelope stress, triggering sporulation. So, it would be a general stress response that also happens to be triggered by T6SS. Now, some general (cell envelope) stress responses are known to be very effective at protecting against the T6SS. But in those instances, a beneficial effect for survival in the face of T6SS attacks has been shown in dedicated experiments. Purely observing a response to a T6SS effector, as this study does (very well), is not evidence that the response has evolved for the purpose of surviving T6SS attacks. Tucked away in the supplement (and briefly mentioned in the main text) is data on Bsub and Bacillus cereus, showing that i) cell densities of the sporulating Bsub and a sporulating B. cereus strain are not affected by an active T6SS, and ii) cell densities of an asporogenic B. cereus are slightly reduced by an active T6SS. However, the effect sizes of density reduction by the T6SS in the asporogenic B. cereus are minute (20x10^6 vs. ~50x10^6). In typical killing assays against e.g. gram-negative strains, a typical effect size for T6SS killing would be a several order of magnitude reduction in survival of the target strain when exposed to a T6SS attacker. Based on this dataset alone (Figure Suppl. 8), I would say that all three Bacillus strains are not experiencing any "fitness-relevant" killing by the T6SS, which is in line with the T6SS often being useless against gram-positives when it comes to killing. Hence, no claims about fitness benefits of sporulation in response to a T6SS attack, or this being a "defense mechanism/strategy" should be made in the manuscript.

      Thanks for this interesting introductory and specific comments. We agree with the reviewer and have rephrased some sentences of the manuscript. Sporulation is not an adaptive or specific response of Bacillus to T6SS, indeed and as stated by reviewer 2, sporulation is a general stress response. It might happen that the way the manuscript was written, at some points, gave the wrong impression. In consequence we have rephrased some sentences. Nevertheless, in Figure supplement 8 (in the new version of the manuscript is Figure 6-figure supplement 3) we made a mistake during generation of the Figure. We have again done this experiment and we have generated a new and corrected chart that shows three orders of magnitude reduction in survival of the asporogenic B. cereus strain in competition with Pchl mutant strains compared to Pchl WT strain. These new findings show that the absence of sporulation ability leads to a severe reduction in survival of Bacillus cereus DSM 2302 population in competition with Pchl with an active T6SS compared to the survival in competition with Pchl hcp mutant. In this figure, it is also shown that Bacillus population also decreased in competition with tse1 mutant, demonstrating that Tse1 is responsible for killing Bacillus. However, there is a statistical difference in the survival of Bacillus competing with hcp or tse1 mutants. The increased survival of Bacillus in the interaction with tse1 strain compared to Bacillus-hcp competition, is suggestive of the ability of this strain to deliver additional T6SS-dependent toxins. This observation is in accordance to the data presented in Fig. 2B, which indicated that tse1 mutant has an active T6SS able to kill E. coli.

      • Data supporting baseline "no competitor" sporulation rates being no different from those triggered by T6SS mutants is not convincing. For the data shown in Fig. 2A, a key comparison here would be to show baseline Bsub sporulation rates in absence of a competitor. This measurement is shown in Fig supplement 2A, and the value shown there (roughly 22% on average) appears to be much lower than the average T6SS mutant shown in Fig. 2A. The main text states that sporulation rates induced rate by the different T6SS mutants are "statistically" similar to the no-competitor baseline (L206/207). I am not convinced by this, since i) overall sporulation rates (incl of WT Pch) appear to have been lower in the experiment shown in supplement 2A, so a direct comparison between the no-competitor baseline and the data shown in Fig. 2A is not possible; and ii) hcp and tse1 mutants were tested in different experiments throughout the study, and sporulation rates appear to consistently hover around 30-40%, which is higher than the roughly 22% for "no competitor" depicted in Supplement Fig2A. I am focussing on this, because for the interpretation of the results, and the main narrative of the paper, knowing if "simply interacting with a T6SS-negative P. chlororaphis" induces some sporulation would make a big difference. One sentence in the discussion adds to my confusion about this: L464/465, "... a strain lacking paar (Δpaar) had an active T6SS that triggered sporulation comparably to Δhcp, ΔtssA, and Δtse1 strains", suggesting that the authors' claims that even strains lacking active T6SS trigger increased sporulation (which I would agree with, based on the data).

      We understand the reviewer's comment that a direct comparison between the two figures is not correct due to fluctuations of the baseline sporulation rates between experiments. To solve this issue, we have added the baseline "no competitor" sporulation percentages in the experiments represented in Figure 2B in the new version of the manuscript.

      Related with the sporulation provoked by a T6SS-negative P. chlororaphis, the reviewer is right. Bacillus sporulation occurs due to many external factors (abiotic and biotic stresses) so the presence of P. chlororaphis in the competition already has an effect on the sporulation percentage of B. subtilis. Accordingly, we have removed the statement on the sporulation rates induced by the different T6SS mutants are "statistically" similar to the no-competitor. However, our previous data (Molina-Santiago, Nat Comm 2019) and current findings convincedly demonstrate the relevance of the T6SS and, specifically the Tse1 toxin, in the induction of sporulation at least in the close cell to cell contact.

      • Claim regarding "bacteriolytic activity" when tse1 is heterologously expressed in E. coli. The data supporting this claim (Fig2-supplement 2C) only shows a lower net population growth rate after induction of tse1 (truncated vs. non-truncated) expression. This could be caused by: slower growth (but no death), equal growth (with some death), or a combination of the two. The claim of "bacteriolytic" activity in E. coli is therefore not supported by this dataset.

      We agree with the reviewer and we have decided to remove this figure and the experiment of “bacteriolytic activity” given that it does not contribute conceptually to the message of the manuscript.

      I cannot comment in more detail on the validity of the biochemistry/enzymatic activity assays as these are not my area of expertise.

      Reviewer #3 (Public Review):

      The authors identify tse1, a gene located in the type 6 secretion system (T6SS) locus of the bacterium Pseudomonas chlororaphis, as necessary and sufficient for induction of Bacillus subtilis sporulation. The authors demonstrate that Tse1 is a hydrolase that targets peptidoglycan in the bacterial cell wall, triggering activation of the regulatory sigma factor sigma-w. The sporulation-inducing effects of sigma-w are dependent on the downstream presence of the sensor histidine kinases KinA and KinB. Overall, this is a well-structured paper that uses a combination of methods including bacterial genetics, HPCL, microscopy, and immunohistochemistry to elucidate the mechanism of action of Tse1 against B. subtilis peptidoglycan. There are some concerns regarding a few experimental controls that were not included/discussed and (in a few figures) the visual representation of the data could be improved. The structure of the manuscript and experiments is such that key questions are addressed in a logical flow that demonstrates the mechanisms described by the authors.

      To begin, we have concerns regarding the sporulation assays and their results. The data should be presented as "Percent sporulation" or "Sporulation (%)" - not as a "sporulation rate": there is no kinetic element to any of these measurements, so no rate is being measured (be careful of this in the text as well, for instance near lines 204). More importantly, there is no data provided to indicate that changes in percent spores are not instead just the death of non-sporulated cells. For example, imagine that within a population of B. subtilis cells, 85% of the cells are vegetative and 15% are spores. If, upon exposure to tse1, a large proportion of the vegetative cells are killed (say, 80% of them), this could lead to an apparent increase in sporulation: from 15% for the untreated population to ~50% of the treated, but the difference would be entirely due to a change in the vegetative population, not due to a change in sporulation. The authors need to clearly describe how they conducted their sporulation assays (currently there is no information about this in the methods) as well as provide the raw data of the counts of vegetative cells for their assays to eliminate this concern.

      Thanks for the suggestion. We have changed all the titles and data presented as “sporulation rate” by “sporulation (%)” or “sporulation percentage”. As also suggested by reviewer 2, we have included the raw data of the CFUs counts of total population and sporulated cells to show that there is no substantial change in the rate of death. Also, we have added a section in Material and Methods to specify how sporulation assays have been done. Quote text:

      “Sporulation assays

      Spots of bacteria were resuspended in 1 mL sterile distilled water. Then, serial dilutions were made and cultured in LB solid media for vegetative cells CFU counts. The same serial dilutions were further heated at 80ºC for 10 minutes to kill vegetative cells and immediately cultured again in LB solid media. Plates were grown overnight at 28 ºC and the resulting colonies were counted to calculate the percentage of Bsub sporulation (%). A list of raw CFUs (total and spore population) from all figures with sporulation percentage is shown in Supplementary file 3.”

      A related concern is regarding the analysis of the kinases and the effects of their deletions on the impact of Tse1. Previous literature shows that the basal levels of sporulation in a B. subtilis kinA or a kinB mutant are severely defective relative to a wild-type strain; these mutants sporulate poorly on their own. Therefore, the data presented on Lines 394+ and the associated Supplemental Figure regarding the sporulation defects of these two mutants are not compelling for showing that these kinases are required for this effector to act. It is likely that simply missing these kinases would severely impact the ability of these strains to sporulate at all, irrespective of the presence of Tse1, and no discussion of this confounding concern is discussed.

      Previous literature shows that mutation of kinases affects sporulation of B. subtilis. Histidine kinases KinA and KinB are the first responsible for initiation of sporulation cascade upon phosphorylation of spo0F. However, as shown in Figure 6-figure supplement 1A, single mutants in these kinases (ΔkinA, ΔkinB) still sporulate given that the phosphorylation cascade is controlled by numerous intermediaries and other histidine kinases that form a multicomponent phosphorelay (KinA-E). In this context, the sporulation of B. subtilis can be also triggered by KinC or KinD in the absence of KinA or KinB, as KinC/KinD can act directly on the master regulator of sporulation Spo0A (Burbulys et al., 1991; Wang et al., 2017).

      In addition, as suggested by reviewer 1, we have added to Figure 6-figure supplement 1A of the new version of the manuscript, the sporulation percentage 'no competitor' control of each kinase mutant and B. subtilis WT. The results show that, as commented by the reviewer and also supported by literature, these mutants sporulate poorly on their own in the absence of an attacker (none). However, as shown in the figure, all kinase mutants increase the sporulation percentage in the presence of a competitor.

      Another concern is regarding the statistical tests used in Figure 2. For statistical tests in A, B, and D, it should be stated whether a post-test was used to correct for multiple comparisons, and, if so, which post-test was used. to provide a stronger control comparison. For C, we suggest the inclusion of a mock control in addition to the two conditions already included (i.e., an extraction from an E. coli strain expressing the empty vector)

      We have clarified the statistical tests used in Figure 2. Briefly, we have used one-way ANOVA followed by the Dunnett test in Figure 2A, B and D for the statistical analysis of the sporulation percentage of Bsub in competition with Pchl as control group. In relation to Figure 2C, it is not possible to add a mock control with a strain carrying the empty vector, because this is a suicide plasmid (pDEST17) unable to replicate in E. coli without chromosome integration.

      An additional concern regarding controls is that there is an absence of loading controls for the immunoblot assays. In Figure 5D and all immunoblot assays, there is no mention of a loading control, which is a critical control that should be included.

      In the previous version of the manuscript, we already included a loading control for Figure 5D in Figure supplement 7B, both for cell and for supernatant fractions. In the new version of the manuscript, the loading control of Figure 6E (in the previous version of the manuscript Figure 5D) is shown in Figure 6-figure supplement 2C. We have also included the original unedited gels and blot (Figure 6-figure supplement 2- source data 1 and Figure 6-figure supplement 2-source data 2).

      Some of the visualizations could be improved to help the reader understand and appropriately interpret the data presented. For instance, in Figures 3 and 4 the scale bars are different across each of the Figure's imaging panels. These should be scaled consistently for better comparison. Additionally, the red false colorization makes the printed images difficult to see. Black-and-white would be easier to see and would not subtract from the images.

      The reviewer is right. Scales bar equal 2 in Figure 3A, but the length of the bars was not the same. We have edited the images to have the same magnifications for better comparison.

      In relation to Figure 4, we have changed the magnifications and now all the figures have the same scale bars and magnifications. In addition, we have added more images of broader fields in Figure 4-figure supplement 1 which were used to measure the percentage of permeabilized cells and to obtain the fluorescence intensity measures shown in Figure 4.

      An additional weakness of the paper is that the RNA-seq data is not fully investigated, and there is an absence of methods included regarding the RNA-seq differential abundance analysis (it is mentioned on L379-380 but no information is provided in the methods). As stated by the authors, 58% of differentially regulated genes belonged to the sw regulon, but the other 42% of genes are not discussed, and will hopefully be a target of future investigations.

      The methods section has been modified for a better explanation of the RNA-seq differential abundance analysis. Quote text: “The raw reads were pre-processed with SeqTrimNext (Falgueras et al., 2010) using the specific NGS technology configuration parameters. This pre-processing removes low-quality, ambiguous and low-complexity stretches, linkers, adapters, vector fragments, and contaminated sequences while keeping the longest informative parts of the reads. SeqTrimNext also discarded sequences below 25 bp. Subsequently, clean reads were aligned and annotated using the Bsub reference genome with Bowtie2 (Langmead and Salzberg, 2012) in BAM files, which were then sorted and indexed using SAMtools v1.484(Li et al., 2009). Uniquely localized reads were used to calculate the read number value for each gene via Sam2counts (https://github.com/vsbuffalo/sam2counts). Differentially expressed genes (DEGs) were analyzed via DEgenes Hunter, which provides a combined p value calculated (based on Fisher’s method) using the nominal p values provided by edgeR (Robinson et al., 2010) and DEseq2. This combined p value was adjusted using the Benjamini-Hochberg (BH) procedure (false discovery rate approach) and used to rank all the obtained DEGs. For each gene, combined p value < 0.05 and log2-fold change > 1 or < −1 were considered as the significance threshold”

      Regarding the RNA-seq analysis, we are aware of the amount of information that can be extracted. Previous to filtering the information shown in the manuscript, we have done bioinformatic analysis trying to find a connection with the cellular response, that is increase of sporulation. Besides this, we had some observations but with no direct connection to sporulation, which would be interesting to pursue in future studies, but not for the clarity of this story (Figure 23 below). In any case, we are including the whole picture of the transcriptomics changes occurring in Bsub after treatment with Tse1. KEGG pathway analyses of genes differentially expressed showed induction of flagellar assembly and aminobenzoate degradation, nitrogen and amino acid metabolisms. Interestingly, fatty acid degradation and CAMP resistance pathways were also induced, probably related to changes suffered in the cell wall after the action of Tse1 toxin. On the other hand, synthesis and degradation of ketone bodies pathway was mostly repressed.

      Figure 2. KEGG pathway analyses of genes differentially expressed occurring in Bsub after treatment with Tse1.

      Another methodological concern in this paper is the limited details provided for the calculation of the permeabilization rate (Figure 4, L359, L662-664). It is not clear how, or if, cell density was controlled for in these experiments.

      We agree with the reviewer and we have explained with more detail how the permeabilization rate was calculated. Quote text: “N=3 for Bsub treated with Tse1 and N=3 for untreated Bsub. N refers to the number of CLSM fields analyzed to calculate the number of permeabilized cells of the total of cells in the field”

      Finally, one weakness of the paper is the broad conclusions that they draw. The authors claim that the mechanism of sporulation activation is conserved across Bacilli when the authors only test one B. subtilis and one B. cereus strain. They further argue (lines 469+) that Tse1 requires a PAAR repeat for its targeting, but do not provide direct evidence for this possibility.

      We have reduced the tone of the final conclusion in order to specify that the activation of sporulation is a mechanism that can be found in different Bacillus species such as Bsub and Bcer. Related with the second appreciation, we have included a further explanation for this argument. Quote text: “As shown in Figure 2B, a paar mutant has an active T6SS able to kill E. coli. However, as shown in Figure 2A, we noticed that a paar mutant (which encodes tse1) is not able to trigger B. subtilis sporulation to a similar level than Pchl WT strain. Given that paar deletion apparently abolishes Tse1 secretion, we suggest that Tse1 is a PAAR-associated effector that requires a PAAR repeat domain protein to be targeted for secretion, thereby increasing Bacillus sporulation during contact with Pseudomonas cells (Cianfanelli et al., 2016; Hachani et al., 2014; Whitney et al., 2014)”.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Elkind et al. use a deep learning segmentation algorithm trained on detecting putative cell nuclei in mouse brains to count cells in the Allen Mouse Brain Connectivity Atlas. The Allen Mouse Brain Connectivity Atlas is a dataset compromising hundreds of mice brains. The authors use this increased statistical power for detecting differences in volume, cell count, and cell density between strains (C57BL/6J and FVB.CD1) as well as sex differences.

      Both volume, cell count, and cell density are regularly used in neuroanatomy to normalize or benchmark results so having a large available dataset for others to compare their data would be a useful resource. The trained segmentation algorithm might also find utility in assays where investigators for one reason or another can't dedicate an entire labeled channel to count cell nuclei.

      Nevertheless, because of technical reasons, I find the current work problematic.

      We thank the Reviewer for acknowledging potential usefulness of our work, and the insightful, helpful comments. We believe this consideration has made our revised manuscript much stronger compared to the initial submission. We hope our revised version will also clear the Reviewer’s remaining doubts.

      Major:

      The authors make use of the "red" channel from the Allen Mouse Brain Connectivity Project (AMBCP). The AMBCP was acquired using two-photon tomography with the TissueCyte 1000 system (http://help.brain-map.org/download/attachments/2818171/Connectivity_Overview.pdf?version=2&modificationDate=1489022310670&api=v2). The sample is illuminated at 925 nm wavelength and the channel the authors describe as autofluorescence is collected through a 593/40 nm bandpass filter. The authors go on to describe their rationale for using this channel for quantifying cell nuclei:

      "We noticed that the red (background) channel of STPT images, taken for the purpose of atlas alignment, typically features dark, round-like objects resembling cell nuclei. We had observed this phenomenon in our own imaging of mouse brains but found little more than anecdotal mentions of it in the literature8,9,10,11".

      The authors here cite a Scientific Reports paper from 2021 with 11 citations, a Journal of Clinical Pathology paper from 2005 with 87 citations, and lastly a paper in Laboratory Investigation from 2016 with 41 citations. The authors completely fail to cite the work from Watt Webb's group (co-inventor of 2p microscopy) in PNAS from 2003 that entirely described the phenomena of native fluorescence by multiphoton- excitation (https://www.pnas.org/doi/10.1073/pnas.0832308100 ), citations so far: 1959 citations. This is either indicative of poor scholarship or an attempt to describe something as novel. Either way, the native fluorescence and second harmonic generation from multiphoton illumination are perfectly characterized by Webb and colleagues and they clearly show the differential effect on nucleosides, retinol, indoleamines, and collagen. This is also where the authors should have paid more attention to discrepancies in their own data when correlated to well-established cell nuclei markers (Murakami et al). The authors will note "black large spots" in the data at specific anatomical regions and structures, like the fornix and stria medullaris: https://connectivity.brain-map.org/projection/experiment/siv/263780729?imageId=263780960&imageType=TWO_PHOTON,SEGMENTATION&initImage=TWO_PHOTON&x=15702&y=18833&z=5

      which is not reproduced in for example the Allen Reference Atlas H&E staining: http://atlas.brain-map.org/atlas?atlas=1&plate=100960284#atlas=1&plate=100960284&resolution=4.19&x=5507.4000244140625&y=5903.39990234375&zoom=-2

      In connection here notice the poor signal in the 2p "autofluorescence" within the paraventricular nucleus: https://connectivity.brain-map.org/projection/experiment/siv/263780729?imageId=263780960&imageType=TWO_PHOTON,SEGMENTATION&initImage=TWO_PHOTON&x=15702&y=17833&z=6

      and then compare it to the H&E staining: http://atlas.brain-map.org/atlas?atlas=1&plate=100960280#atlas=1&plate=100960276&resolution=1.50&x=5342.476283482143&y=5368.023856026786&zoom=0

      These multiphoton-specific signals are especially pronounced in the pons and medulla which makes quantification especially dubious, which is even apparent simply from looking at Figure 1c in the manuscript.

      We thank the Reviewer for the comments and sincerely apologize for missing the seminal work of Webb’s group. We included the former references for their specific mention or illustration of non-autofluorescent nuclei. We indeed entirely missed to address the underlying chemistry that Webb’s group beautifully characterized. We have added the following sentence in the Results section “Autofluorescence of STPT images displays cell nuclei” (red font for new sentence; Reference #15 corresponds to Zipfel et al.):

      “We noticed that the red (background) channel of STPT images, taken for the purpose of atlas alignment, typically features dark, round-like objects resembling cell nuclei. This phenomenon was described in previous literature11,12,13,14. In particular, Zipfel et al. characterized the use of multiphoton-excited native florescence and second harmonic generation for the purpose of staining-free tissue imaging15.”

      And mentioned the dependency of our method on the presence of intrinsically fluorescent molecules in the Discussion:

      “The study has several limitations. First, the model is sensitive to the contrast between dark nuclei and autofluorescent surroundings, which can be limited by image quality and tissue composition. In particular, the staining-free approach depends on the presence of intrinsic molecular indicators such as NADH, retinol or collagen15, which may vary between cell or tissue components, even within the brain.”

      We understand that more generally, the Reviewer’s major concern above was regarding the technical validity of our approach; that the segmentation based on small objects lacking autofluorescence, as evident in the STPT dataset, in fact corresponds to cells/nuclei.

      In our initial Supplemental Figure 1 (in current version Figure 1—figure supplement 1) we provide technical validation of the method, by showing nuclear staining, and autofluorescence side-by-side, using epifluorescence microscopy. In our revision we now report appropriate statistical measures for this analysis (true positives, false positives, false negatives).

      In addition, we performed the following two sets of validations –

      (i) Technical validation of our staining-free quantification approach, by nuclear staining. We performed nuclear staining (Hoechst 33342) followed by STPT imaging of 9 female brains and trained a new deep neural network (DNN) to segment the resulting images (STPT was performed by TissueVision). Unfortunately, in STPT it is not technically possible to analyze nuclear staining and autofluorescence in the very same tissue. Therefore, we compared per-region density, cell count and volume of the nuclei-stained validation brains to our original DNN-based analysis of AMBCA brains. We show a correlation coefficient >0.99 for per-region cell count in AMBCA autofluorescence and our nuclear staining (and a similar correlation coefficient for volume). However, the number of cells in nuclear staining over the whole brain is 56% larger than in autofluorescence. Although we currently have no technically feasible way to prove this, one likely explanation for this discrepancy is the nature of the two signals the imaging detects; as positive (Hoechst fluorophore) or autofluorescence. Further, discrepancies between the two methods were notably higher in glial-rich tissues (e.g., CTX L1, midbrain, brainstem) – leading to the speculation that low-autofluorescent object-counts may be biased to detect neurons, rather than glia.

      (ii) Independent validation of the biological findings – discussed further below. Regarding the specific concern of “black large spots” in the fornix and stria medullaris – we would like to emphasize that our DNN does not identify and segment dark regions like ventricles and tracks. We provide in the Author Response Image 1 three examples featuring “black large spots” of different shapes and size, with examples of the segmentation results as shown in Figures 1 and 2 of the manuscript. Note that colored circles, that appear as dots depending on magnification, are the objects that were detected and segmented by the DNN. In the Figure we demonstrate that (1) fiber tracts (incl. fornix, stria medullaris) are not segmented; (2) striatal patches (that are smaller still than the fiber tracts in question) are not segmented; and (3) putative blood vessels, appearing as elongated, black structures, are ignored by our DNN.

      Author Response Image 1. How does the DNN deal with large black spots? Examples for fiber tracts, striatal patches, and blood vessels; adapted from Figures 1 and 2 in the manuscript. Note that dots/outlines represent segmented putative “nuclei” as detected by the model, colored by assigned region according to Allen Mouse Brain hierarchy. Example (1): fiber tracts (incl. fornix, stria medullaris) are not segmented. Example (2): Striasomes (patches in the striatum, that are smaller still than the fiber tracts in question) are not segmented, and the much smaller objects that are detected as putative nuclei are indicated by arrows. Example (3) putative blood vessels, appearing as elongated, black structures, are ignored by our DNN. Examples of the segmentation images were adapted from the manuscript’s Figure 1 to correspond to the STPT image featuring fiber tracts (and Striasomes/patches) was pointed out by the Reviewer.

      Retrieved from: https://connectivity.brain-map.org/projection/experiment/siv/263780729?imageId=263780960&imageType=TWO_PHOTON,SEGMENTATION&initImage=TWO_PHOTON&x=15702&y=18833&z=5.

      Regarding the claim of problematic counting in brain stem regions, we agree, and had addressed this limitation in the manuscript’s Discussion (see below). We believe that our counting is valuable even if in some regions there is a significant systematic error: Most of the analyses in this study compare brain regions across individuals and thus systematic error is less impactful. In the revision, we nevertheless took care to validate and quantify the size of this effect. Briefly, we compared counting based on nuclear staining (Hoechst) from 9 STPT imaged brains, to our quantifications of non-autofluorescent objects. As expected, the ratio between these counts depends on the brain region, and accuracy is better in regions with high brightness, which are not on the border of the section (Figure 2—figure supplement 2). As for pons and medulla, the densities in our Hoechst quantifications are 43% and 60% higher than in our AMBCA analysis, respectively, yet rank order is kept in both.

      We have revised the relevant sentences in the Discussion:

      Original sentences: The study has several limitations. … In the hindbrain (pons, medulla), contrast was exceedingly weak, and we expect our quantifications in this region to strongly underestimate real cell densities, to an extent we cannot quantify.

      Revised sentences: The study has several limitations. … In the hindbrain (pons, medulla), contrast was exceedingly weak, and we expect our quantifications in this region to be 66% of the value estimated by nuclear staining (Figure 2—figure supplement 2).

      The authors here use the correlation on log-log coordinates between their data and that of Murakami et al to argue that the method has validity. However, the variance explained here is R^2 = 0.74 which is very poor given the log-log coordinates. A more valid metric would use linear coordinates and computing the ICC and interpret it according to established guidelines (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/).

      As mentioned by the Reviewer, Figure 2D compares Murakami et al. cell counts and ours, across all brain regions. The value “r=0.869” represents the correlation coefficient between the two vectors in log scale and not the R^2. We also now display the correlation coefficient for the linear scale, in which case p=0.98. As suggested by the Reviewer, we added ICC values between the two vectors in linear scale. Using 6 different forms (ICC – 1-1;1-k;C-1;C-k;A-1;A-k), the ICC values were 0.98-0.99, thus corresponding to an excellent agreement (ICC values are mentioned in legend of Figure 2).

      Author Response Image 2 displays the revised Figure 2D (left), and the log value of the ratio between the AMBCA-based cell count and the Murakami-based value (right), as a function of region volume. The mean value across regions is zero, corresponding to similar cell counts in both methods. Indeed, there exist outlier regions, that may be attributed to either registration errors, different experimental protocols or may stem from the fact that the Murakami values are based on 3 brains, compared to hundreds of AMBCA brains.

      Author Response Image 2. Correlation with cell counts in Murakami et al. Left, revised Figure 2D; Right, ratio between AMBCA-based cell counts and Murakami et al. counts, as a function of region volume

      In addition to the above concern, the authors argue that the large sample size of the AMBCP is what would enable them to find statistically significant small effect sizes that might have gone undetected in the literature. However, this argument falls flat once we examine some of the main findings the authors report. Although the authors do not directly report measures of dispersion we can estimate it from the figures and then arrive at the sample size needed to find the reported effect size. For example, the effect that describes ORBvl2/3 volume is larger in female mice compared to males would only require n=13 mice at the desired power of 0.8. Likewise, the sample size needed to detect the increased BST volume in male mice looks to be roughly n=16 mice at the desired power of 0.8. Both of these estimates are well within what is a reasonable sample size to expect in an ordinary study. This begs the question: why did the authors simply not verify some of their main findings in an independent sample obtained through traditional ways to quantify volume and cell density since it is well within reach? Such validation would strengthen the arguments of the paper.

      We thank the reviewer for this comment and apologize. In the revised version we do report dispersion.

      We would like to emphasize that due to our restricted time and resources, we decided to focus our experimental validation on the technical comparison of nuclear staining vs. autofluorescence-based segmentation, outlined above.

      We then verified the biological findings from the initial cohort using C57BL/6J volume data from an additional 663 males vs.166 females on AMBCA. This independent cohort showed similar sexual dimorphism in the volume of MEA, BST and ORBvl2/3, as depicted in the following figure (panels A-D and also as new Figure 4—figure supplement 1).

      We fully acknowledge the interesting issue raised on sample sizes required to detect our reported effect sizes. Therefore, we here also present the average p-value for sexual dimorphism in volumes of MEA, BST and ORBvl2/3, as a function of the sample size (panel E in Figure 4—figure supplement 1 of the revised manuscript). The Reviewer will note that the regions with largest effect size (MEA, BST) can be detected within more ordinary sample sizes, and indeed, MEA and BST dimorphism is evident in the literature. ORB dimorphism required much greater sample size; and our analysis (Figure 4) systematically detected many more dimorphic regions, in volume, density and count.

      Reviewer #2 (Public Review):

      This report describes a large-scale analysis of cell counts in mouse brains. The authors found that the Allen Mouse Connectivity project has a rich dataset for cell counting that is yet to be analyzed, and they developed methods to quantify cells in different nuclei. They go on to compare males vs females and two different strains. From this analysis, they found specific differences between male versus female brains, left versus right hemispheres, and C57BL/6 versus FVB.CD1 mice, especially with regard to cell counts and density.

      Overall, the methodology is sound and the quality of the data seems high. In fact, this study uses >100 brains for the statistics, and this is one of the major strengths of this study. For researchers who are interested in interrogating the differences at the macroscopic level in brain structures, this study will be a great resource. For example, the manuscript contains an interesting finding that for most brain areas, females have larger volumes but fewer cell numbers.

      We thank the Reviewer for these comments. We would like to mention that the revised version of the manuscript does not include a statement regarding BL6 female volume. We found a batch effect in the AMBCA experiments, mostly affecting the volume in their first batch (Figure 2—figure supplement 1B). That batch included mostly males, and had, for some reason, lower volume compared to all later experiments, which caused the volume differences. We emphasize that (1) the total number of cells did not show any batch effect (Figure 2—figure supplement 1C); (2) We normalized the volume and repeated the analysis. Aside the finding that females did not in fact have larger volumes, other main findings remained unchanged.

      Reviewer #3 (Public Review):

      Elkind et al. have devised a strategy to detect cells in whole brain samples of the large, publicly accessible Allen Mouse Brain Connectivity database. They put together an analysis pipeline to quantify cell numbers and -density as well as volumes for all annotated brain areas in these samples. This allowed them to make several important discoveries such as (1) strain-, sex- and hemisphere-specific differences in cell densities, (2) a large interindividual variability in cell numbers, and (3) an absence of linear scaling of cell count with volume, among others. The key strength of this work lies in its comprehensive analysis, the large sample size that the authors have drawn from (making their conclusions particularly robust), and the fact that they have made their analysis tools accessible. A weakness of the current manuscript is the dense layout and overplotting of several of the figures, and the lack of necessary information to understand them more easily. Another, conceptual weakness of using the autofluorescence channel for cell detection is that the identity (neuronal vs non-neuronal) of the underlying cells remains unresolved. Overall, however, I believe that this study has the potential to serve as a valuable reference point, and I would expect this work to have a lasting impact on quantitative studies of mouse brain cytoarchitecture.

      We thank the Reviewer for these valuable comments. We have tried to minimize overplotting of figures and hopefully added all necessary information. For example, the revised manuscript presents more pared-down figures, with data labels omitted if they crowded the graphic. Instead, we provide the full data in Supplemental tables, and our online accessible GUI. We hope the reader will feel encouraged to both zoom the presented data, more deeply explore additional tables, and our online tool.

      Regarding the question of cell types, we were unfortunately not able to provide a definitive answer, but our validation experiments provided some potential clues. For example, nuclear staining (Hoechst) uniformly detected 65% more cells than AMBCA autofluorescence quantification. And, in neuron-rich regions, the correspondence between nuclear staining and AMBCA autofluorescence was notably better than in glia-rich regions (e.g., CTX L1, midbrain, medulla). These discrepancies between the techniques may therefore point to an underlying difference in cell types composition – such that counting low-autofluorescent nuclei is biased to neurons.

      In addition, however, the methods differ in their native physical properties; in that one detects presence of a fluorescent signal (e.g., the nuclear stain is detected beyond its focal plane), compared to the detection of the absence of a signal (which, in turn, is dependent on the presence of surrounding intrinsic fluorescent molecules). It is technically non-trivial to assess the extent to which these factors apply. We have added a clarification along these lines in the Discussion (below). We would further like to emphasize the nature of our study as a comparative, systematic analysis within this interesting cohort, rather than providing definitive cell counts – that we found to be greatly variable across the population.

      “We further attempted to estimate the region-specific accuracy of our cell counting by comparing autofluorescence STPT with brain-wide imaging of nuclear-stained STPT. However, this comparison is technically nontrivial because of the native physical properties of direct staining vs. autofluorescence. For example, stained nuclei located off the focal plane may appear in the image, yet remain undetected by autofluorescence. In addition, tissue composition (e.g., cell types, extracellular matrix) may affect the imaged region. Indeed, in regions rich with non-neuronal cells the error of autofluorescent-based counting was larger compared to nuclear staining. Hence, one may speculate that autofluorescent-based detection is biased for neurons”

    1. Author Response:

      Reviewer #1 (Public Review):

      Chakrabarti et al study inner hair cell synapses using electron tomography of tissue rapidly frozen after optogenetic stimulation. Surprisingly, they find a nearly complete absence of docked vesicles at rest and after stimulation, but upon stimulation vesicles rapidly associate with the ribbon. Interestingly, no changes in vesicle size were found along or near the ribbon. This would have indicated a process of compound fusion prior to plasma membrane fusion, as proposed for retinal bipolar cell ribbons. This lack of compound fusion is used to argue against MVR at the IHC synapse. However, that is only one form of MVR. Another form, coordinated and rapid fusion of multiple docked vesicles at the bottom of the ribbon, is not ruled out. Therefore, I agree that the data set provides good evidence for rapid replenishment of the ribbon-associated vesicles, but I do not find the evidence against MVR convincing. The work provides fundamental insight into the mechanisms of sensory synapses.

      We thank the reviewer for the appreciation of our work and the constructive comments. As pointed out below, we now included this discussion (from line 679 onwards).

      We wrote:

      “This might reflect spontaneous univesicular release (UVR) via a dynamic fusion pore (i.e. ‘kiss and run’, (Ceccarelli et al., 1979), which was suggested previously for IHC ribbon synapses (Chapochnikov et al., 2014; Grabner and Moser, 2018; Huang and Moser, 2018; Takago et al., 2019) and/or and rapid undocking of vesicles (e.g. Dinkelacker et al., 2000; He et al., 2017; Nagy et al., 2004; Smith et al., 1998). In the UVR framework, stimulation by ensuing Ca2+ influx triggers the statistically independent release of several SVs. Coordinated multivesicular release (MVR) has been indicated to occur at hair cell synapses (Glowatzki and Fuchs, 2002; Goutman and Glowatzki, 2007; Li et al., 2009) and retinal ribbon synapses (Hays et al., 2020; Mehta et al., 2013; Singer et al., 2004) during both spontaneous and evoked release. We could not observe structures which might hint towards compound or cumulative fusion, neither at the ribbon nor at the AZ membrane under our experimental conditions. Upon short and long stimulation, RA-SVs as well as docked SVs even showed a slightly reduced size compared to controls. However, since some AZs harbored more than one docked SV per AZ in stimulated conditions, we cannot fully exclude the possibility of coordinated release of few SVs upon depolarization.”

      Reviewer #2 (Public Review):

      Chakrabarti et al. aimed to investigate exocytosis from ribbon synapses of cochlear inner hair cells with high-resolution electron microscopy with tomography. Current methods to capture the ultrastructure of the dynamics of synaptic vesicle release in IHCs rely on the application of potassium for stimulation, which constrains temporal resolution to minutes rather than the millisecond resolution required to analyse synaptic transmission. Here the authors implemented a high-pressure freezing method relying on optogenetics for stimulation (Opto-HPF), granting them both high spatial and temporal resolutions. They provide an extremely well-detailed and rigorously controlled description of the method, falling in line with previously use of such "Opto-HPF" studies. They successfully applied Opto-HPF to IHCs and had several findings at this highly specialised ribbon synapse. They observed a stimulation-dependent accumulation of docked synaptic vesicles at IHC active-zones, and a stimulation-dependent reduction in the distance of non-docked vesicles to the active zone membrane; while the total number of ribbon-associated vesicles remained unchanged. Finally, they did not observe increases in diameter of synaptic vesicles proximal to the active zone, or other potential correlates to compound fusion - a potential mode of multivesicular release. The conclusions of the paper are mostly well supported by data, but some aspects of their findings and pitfalls of the methods should be better discussed.

      We thank the reviewer for the appreciation of our work and the constructive comments.

      Strengths:

      While now a few different groups have used "Opto-HPF" methods (also referred to as "Flash and Freeze) in different ways and synapses, the current study implemented the method with rigorous controls in a novel way to specifically apply to cochlear IHCs - a different sample preparation than neuronal cultures, brain slices or C. elegans, the sample preparations used so far. The analysis of exocytosis dynamics of IHCs with electron microscopy with stimulation has been limited to being done with the application of potassium, which is not physiological. While much has been learned from these methods, they lacked time resolution. With Opto-HPF the authors were successfully able to investigate synaptic transmission with millisecond precision, with electron tomography analysis of active zones. I have no overall questions regarding the methodology as they were very thoroughly described. The authors also employed electrophysiology with optogenetics to characterise the optical simulation parameters and provided a well described analysis of the results with different pulse durations and irradiance - which is crucial for Opto-HPF.

      Thank you very much.

      Further, the authors did a superb job in providing several tables with data and information across all mouse lines used, experimental conditions, and statistical tests, including source code for the diverse analysis performed. The figures are overall clear and the manuscript was well written. Such a clear representation of data makes it easier to review the manuscript.

      Thank you very much.

      Weaknesses:

      There are two main points that I think need to be better discussed by the authors.

      The first refers to the pitfalls of using optogenetics to analyse synaptic transmission. While ChR2 provides better time resolution than potassium application, one cannot discard the possibility that calcium influx through ChR2 alters neurotransmitter release. This important limitation of the technique should be properly acknowledged by the authors and the consequences discussed, specifically in the context in which they applied it: a single sustained pulse of light of ~20ms (ShortStim) and of ~50ms (LongStim). While longer, sustained stimulation is characteristic for IHCs, these are quite long pulses as far as optogenetics and potential consequences to intrinsic or synaptic properties.

      We thank the reviewer for pointing this out. We would like to mention that upon 15 min high potassium depolarization, the number of docked SVs only slightly increased as shown in Chakrabarti et al., 2018, EMBO rep and Kroll et al. 2020 JCS, but it was not statistically significant. In the current study, we report a similar phenomenon, but here light induced depolarization resulted in a more robust increase in the number of docked SVs.

      To compare the data from the previous studies with the current study, we included an additional table 3 (line 676) now in the discussion with all total counts (and average per AZ) of docked SVs.

      Furthermore, in response to the reviewers’ concern, we now discuss the Ca2+ permeability of ChR2 in addition to the above comparison to our previous studies that demonstrated very few docked SVs in the absence of K+ channel blockers and ChR2 expression in IHCs. We are not entirely certain, if the reviewer refers to potential dark currents of ChR2 (e.g. as an explanation for a depletion of docked vesicles under non-stimulated conditions) or to photocurrents, the influx of Ca2+ through ChR2 itself, and their contribution to Ca2+ concentration at the active zone.

      However, regardless this, we consider it unlikely that a potential contribution of Ca2+ influx via ChR2 evokes SV fusion at the hair cell active zone.

      First of all, we note that the Ca2+ affinity of IHC exocytosis is very low. As first shown in Beutner et al., 2001 and confirmed thereafter (e.g. Pangrsic et al., 2010), there is little if any IHC exocytosis for Ca2+ concentrations at the release sites below 10 µM. Two studies using CatCh (a ChR2 mutant with higher Ca2+ permeability than wildtype ChR2 (Kleinlogel et al., 2011; Mager et al., 2017) estimated a max intracellular Ca2+ increase below 10 µM, even at very negative potentials that promote Ca2+ influx along the electrochemical gradient or at high extracellular Ca2+ concentrations of 90 mM. In our experiments, IHCs were depolarized, instead, to values for which extrapolation of the data of Mager et al., 2017 indicate a submicromolar Ca2+ concentration. In addition, we and others have demonstrated powerful Ca2+ buffering and extrusion in hair cells (e.g. Tucker and Fettiplace, 1995; Issa and Hudspeth., 1996; Frank et al., 2009 Pangrsic et al., 2015). As a result, the hair cells efficiently clear even massive synaptic Ca2+ influx and establish a low bulk cytosolic Ca2+ concentration (Beutner and Moser, 2001; Frank et al., 2009). We reason that these clearance mechanisms efficiently counter any Ca2+ influx through ChR2. This will likely limit potential effects of ChR2 mediated Ca2+ influx on Ca2+ dependent replenishment of synaptic vesicles during ongoing stimulation.

      We have now added the following in the discussion (starting in line 620):

      “We note that ChR2, in addition to monovalent cations, also permeates Ca2+ ions and poses the question whether optogenetic stimulation of IHCs could trigger release due to direct Ca2+ influx via the ChR2. We do not consider such Ca2+ influx to trigger exocytosis of synaptic vesicles in IHCs. Optogenetic stimulation of HEK293 cells overexpressing ChR2 (wildtype version) only raises the intracellular Ca2+ concentration up to 90 nM even with an extracellular Ca2+ concentration of 90 mM (Kleinlogel et al., 2011). IHC exocytosis shows a low Ca2+ affinity (~70 µM, Beutner et al., 2001) and there is little if any IHC exocytosis for Ca2+ concentrations below 10 µM, which is far beyond what could be achieved even by the highly Ca2+ permeable ChR2 mutant (CatCh: Ca2+ translocating channelrhodopsin, Mager et al., 2017). In addition, we reason that the powerful Ca2+ buffering and extrusion by hair cells (e.g., Frank et al., 2009; Issa and Hudspeth, 1996; Pangršič et al., 2015; Tucker and Fettiplace, 1995) will efficiently counter Ca2+ influx through ChR2 and, thereby limit potential effects on Ca2+ dependent replenishment of synaptic vesicles during ongoing stimulation. “

      The second refers to the finding that the authors did not observe evidence of compound fusion (or homotypic fusion) in their data. This is an interesting finding in the context of multivesicular release in general, as well as specifically for IHCs. While the authors discussed the potential for "kiss-and-run" and/or "kiss-and-stay", it would be valuable if they could discuss their findings further in the context of the field for multivesicular release. For example, the evidence in support of the potential of multiple independent release events. Further, as far as such function-structure optical-quick-freezing methods, it is not unusual to not capture fusion events (so-called omega-shapes or vesicles with fusion pores); this is largely because these are very fast events (less than 10 ms), and not easily captured with optical stimulation.

      We agree with the reviewer that the discussion on MVR and UVR should be extended. We now added the following paragraph to the discussion from line 679 on:

      “This might reflect spontaneous univesicular release (UVR) via a dynamic fusion pore (i.e. ‘kiss and run’, (Ceccarelli et al., 1979), which was suggested previously for IHC ribbon synapses (Chapochnikov et al., 2014; Grabner and Moser, 2018; Huang and Moser, 2018; Takago et al., 2019) and/or and rapid undocking of vesicles (e.g. Dinkelacker et al., 2000; He et al., 2017; Nagy et al., 2004; Smith et al., 1998). In the UVR framework, stimulation by ensuing Ca2+ influx triggers the statistically independent release of several SVs. Coordinated multivesicular release (MVR) has been indicated to occur at hair cell synapses (Glowatzki and Fuchs, 2002; Goutman and Glowatzki, 2007; Li et al., 2009) and retinal ribbon synapses (Hays et al., 2020; Mehta et al., 2013; Singer et al., 2004) during both spontaneous and evoked release. We could not observe structures which might hint towards compound or cumulative fusion, neither at the ribbon nor at the AZ membrane under our experimental conditions. Upon short and long stimulation, RA-SVs as well as docked SVs even showed a slightly reduced size compared to controls. However, since some AZs harbored more than one docked SV per AZ in stimulated conditions, we cannot fully exclude the possibility of coordinated release of few SVs upon depolarization.”

      Reviewer #3 (Public Review):

      Precise methods were developed to validate the expression of channelrhodopsin in inner hair cells of the Organ of Corti, to quantify the relationship between blue light irradiance and auditory nerve fiber depolarization, to control light stimulation within the chamber of a high-pressure freezing device, and to measure with good precision the delay between stimulation and freezing of the specimen. These methods represent a clear advance over previous experimental designs used to study this synaptic system and are an initial application of rapid high-pressure freezing with freeze substitution, followed by high-resolution electron tomography (ET), to sensory cells that operate via graded potentials.

      Short-duration stimuli were used to assess the redistribution of vesicles among pools at hair cell ribbon synapses. The number of vesicles linked to the synaptic ribbon did not change, but vesicles redistributed within the membrane-proximal pool to docked locations. No evidence was found for vesicle-to-vesicle fusion prior to vesicle fusion to the membrane, which is an important, ongoing question for this synapse type. The data for quantifying numbers of vesicles in membrane-tethered, non-tethered, and docked vesicle pools are compelling and important.

      We thank the reviewer for the appreciation of our work and the constructive comments.

      These quantifications would benefit from additional presentation of raw images so that the reader can better assess their generality and variability across synaptic sites.

      The images shown for each of the two control and two experimental (stimulated) preparation classes should be more representative. Variation in synaptic cleft dimensions and numbers of ribbon-associated and membrane-proximal vesicles do not track the averaged data. Since the preparation has novel stimulus features, additional images (as the authors employed in previous publications) exhibiting tethered vesicles, non-tethered vesicles, docked vesicles, several sections through individual ribbons, and the segmentation of these structures, will provide greater confidence that the data reflect the images.

      Thank you very much for pointing this out. We now included more details in supplemental figures and in the text.

      Precisely, we added:

      • More details about the morphological sub-pools (analysis and images):

        -We now show a sequence of images with different tethering states of membrane proximal SVs together with examples for docked and non-tethered SVs as we did in Chakrabarti et al., 2018 for each condition (Fig. 6-figure supplement 2, line 438). Moreover, we included for each condition additional information, we selected further tomograms, one per condition, and depict two additional virtual sections: Fig. 6-figure supplement 2.

        -Moreover, we present a more detailed quantification for the different morphological sub-pools: For the MP-SV pool, we analyzed the SV diameters and the distances to the AZ membrane and PD of different SV sub-pools separately, we now included this information in Fig. 7 For the RA-SVs, we analyzed in addition the morphological sub-pools and the SV diameters in the distal and the proximal ribbon part as done in Chakrabarti et al. 2018. We now added a new supplement figure (Fig. 7-figure supplement 2, line 558 and a supplementary file 2).

      • We replaced the virtual section in panel 6D: In the old version, it appeared that the ribbon was contacting the membrane and we realized that this virtual section was not representative: actually, the ribbon was not directly contacting the AZ membrane, a presynaptic density was still visible adjacent to the docked SVs. To avoid potential confusion, we selected a different virtual section of the same tomogram and now indicated the presynaptic density also as graphical aid in Fig. 6.

      The introduction raises questions about the length of membrane tethers in relation to vesicle movement toward the active zone, but this topic was not addressed in the manuscript.

      We apologize for not stating it sufficiently clear, we now rephrased this sentence. We now wrote:

      “…and seem to be organized in sub-pools based on the number of tethers and to which structure these tethers are connected. “

      Seemingly quantification of this metric, and the number of tethers especially for vesicles near the membrane, is straightforward. The topic of EPSC amplitude as representing unitary events due to variation in vesicle volume, size of the fusion pore, or vesicle-vesicle fusion was partially addressed. Membrane fusion events were not evident in the few images shown, but these presumably occurred and could be quantified. Likewise, sites of membrane retrieval could also be marked. These analyses will broaden the scope of the presentation, but also contribute to a more complete story.

      Regarding the presence/absence of membrane fusion events we agree with the reviewer that this should be clearly addressed in the MS. We would like to point out that we

      (i) did not observe any omega shapes at the AZ membrane, which we also mention in the MS. We can also report that we could not see them in data sets from previous publications (Vogl et al., 2015, JCS; Jung et al., 2015, PNAS).

      (ii) To be clear on our observations on potential SV-SV fusion events we now point out in the discussion from line 688ff:

      “We could not observe structures which might hint towards compound or cumulative fusion, neither at the ribbon nor at the AZ membrane under our experimental conditions. Upon short and long stimulation, RA-SVs as well as docked SVs even showed a slightly reduced size compared to controls. However, since some AZs harbored more than one docked SV per AZ in stimulated conditions, we cannot fully exclude the possibility of coordinated release of few SVs upon depolarization.”

      Furthermore, we agree with the reviewer that a complete presentation of endo-exocytosis structural correlates is very important. However, we focused our study on exocytosis events and therefore mainly analyzed membrane proximal SVs at active zones.

      Nonetheless, in response to the reviewer’s comment, we now included a quantification of clathrin-coated (CC) structures. We determined the appearance of CC vesicles (V) and CC invaginations within 0-500 nm away from the PD. We measured the diameter of the CCV, and their distance to the membrane and the PD. We only found very few CC structures in our tomograms (now added in a table to the result section (Supplementary file 1). Sites for endocytic membrane retrieval likely are in the peri-active zone area or even beyond. We did not observe obvious bulk endocytosis events that were connected to the AZ membrane. However, we do observe large endosomal like vesicles that we did not quantify in this study. More details were presented in two of our previous studies: Kroll et al., 2019 and 2020, however, under different stimulation conditions.

      Overall, the methodology forms the basis for future studies by this group and others to investigate rapid changes in synaptic vesicle distribution at this synapse.

      Reviewer #4 (Public Review):

      This manuscript investigates the process of neurotransmitter release from hair cell synapses using electron microscopy of tissue rapidly frozen after optogenetic stimulation. The primary finding is that in the absence of a stimulus very few vesicles appear docked at the membrane, but upon stimulation vesicles rapidly associate with the membrane. In contrast, the number of vesicles associated with the ribbon and within 50 nm of the membrane remains unchanged. Additionally, the authors find no changes in vesicle size that might be predicted if vesicles fuse to one-another prior to fusing with the membrane. The paper claims that these findings argue for rapid replenishment and against a mechanism of multi-vesicular release, but neither argument is that convincing. Nonetheless, the work is of high quality, the results are intriguing, and will be of interest to the field.

      We thank the reviewer for the appreciation of our work and the constructive comments.

      1) The abstract states that their results "argue against synchronized multiquantal release". While I might agree that the lack of larger structures is suggestive that homotypic fusion may not be common, this is far from an argument against any mechanisms of multi-quantal release. At least one definition of synchronized multiquantal release posits that multiple vesicles are fusing at the same time through some coordinated mechanism. Given that they do not report evidence of fusion itself, I fail to see how these results inform us one way or the other.

      We agree with the reviewer that the discussion on MVR and UVR should be extended. It is important to point out that we do not claim that the evoked release is mediated by one single SV. As discussed in the paper (line 672), we consider that our optogenetic stimulation of IHCs triggers the release of more than 10 SVs per AZ. This falls in line with the previous reports of several SVs fusing upon stimulation. This type of evoked MVR is probably mediated by the opening of Ca2+ channels in close proximity to each SV Ca2+ sensor. We indeed sometimes observed more than one docked SV per AZ upon long optogenetic stimulation. This could reflect that possibility. However, given the absence of large structures directly at the ribbon or the AZ membrane that could suggest the compound fusion of several SVs prior or during fusion, we argue against compound MVR release at IHCs. As mentioned above, we added to the discussion (from line 679 onwards).

      We wrote:

      “This might reflect spontaneous univesicular release (UVR) via a dynamic fusion pore (i.e. ‘kiss and run’, (Ceccarelli et al., 1979), which was suggested previously for IHC ribbon synapses (Chapochnikov et al., 2014; Grabner and Moser, 2018; Huang and Moser, 2018; Takago et al., 2019) and/or and rapid undocking of vesicles (e.g. Dinkelacker et al., 2000; He et al., 2017; Nagy et al., 2004; Smith et al., 1998). In the UVR framework, stimulation by ensuing Ca2+ influx triggers the statistically independent release of several SVs. Coordinated multivesicular release (MVR) has been indicated to occur at hair cell synapses (Glowatzki and Fuchs, 2002; Goutman and Glowatzki, 2007; Li et al., 2009) and retinal ribbon synapses (Hays et al., 2020; Mehta et al., 2013; Singer et al., 2004) during both spontaneous and evoked release. We could not observe structures which might hint towards compound or cumulative fusion, neither at the ribbon nor at the AZ membrane under our experimental conditions. Upon short and long stimulation, RA-SVs as well as docked SVs even showed a slightly reduced size compared to controls. However, since some AZs harbored more than one docked SV per AZ in stimulated conditions, we cannot fully exclude the possibility of coordinated release of few SVs upon depolarization.”

      2) The complete lack of docked vesicles in the absence of a stimulus followed by their appearance with a stimulus is a fascinating result. However, since there are no docked vesicles prior to a stimulus, it is really unclear what these docked vesicles represent - clearly not the RRP. Are these vesicles that are fusing or recently fused or are they ones preparing to fuse? It is fine that it is unknown, but it complicates their interpretation that the vesicles are "rapidly replenished". How does one replenish a pool of docked vesicles that didn't exist prior to the stimulus?

      In response to the reviewers’ comment, we would like to note that we indeed reported very few docked SVs in wild type IHCs at resting conditions without K+ channel blockers in Chakrabarti et al. EMBO Rep 2018 and in Kroll et al., 2020, JCS. In both studies, a solution without TEA and Cs was used for the experiments (resting solution Chakrabarti: 5 mM KCl, 136.5 mM NaCl, 1 mM MgCl2, 1.3 mM CaCl2, 10 mM HEPES, pH 7.2, 290 mOsmol; control solution Kroll: 5.36 mM KCl, 139.7 mM NaCl, 2 mM CaCl2, 1 mM MgCl2, 0.5 mM MgSO4, 10 mM HEPES, 3.4 mM L-glutamine, and 6.9 mM D-glucose, pH 7.4). Similarly, our current study shows very few docked SVs in the resting condition even in the presence of TEA and Cs. Based on the results presented in ‘Response to reviewers Figure 1’, we assume that the scarcity of docked SVs under control conditions is not due to depolarization induced by a solution containing 20 mM TEA and 1 mM Cs but is rather representative for the physiological resting state of IHC ribbon synapses. Upon 15 min high potassium depolarization, the number of docked SVs only slightly increased as shown in Chakrabarti et al., 2018 and Kroll et al. 2020, but it was not statistically significant. In the current study, we report a similar phenomenon, but here depolarization resulted in a more robust increase in the number of docked SVs.

      To compare the data from the previous studies with the current study, we included an additional table 3 (line 676) now in the discussion with all total counts (and average per AZ) of docked SVs.

    1. Author Response

      eLife assessment:

      This study addresses whether the composition of the microbiota influences the intestinal colonization of encapsulated vs unencapsulated Bacteroides thetaiotaomicron, a resident micro-organism of the colon. This is an important question because factors determining the colonization of gut bacteria remain a critical barrier in translating microbiome research into new bacterial cell-based therapies. To answer the question, the authors develop an innovative method to quantify B. theta population bottlenecks during intestinal colonization in the setting of different microbiota. Their main finding that the colonization defect of an acapsular mutant is dependent on the composition of the microbiota is valuable and this observation suggests that interactions between gut bacteria explains why the mutant has a colonization defect. The evidence supporting this claim is currently insufficient. Additionally, some of the analyses and claims are compromised because the authors do not fully explain their data and the number of animals is sometimes very small.

      Thank you for this frank evaluation. Based on the Reviewers’ comments, the points raised have been addressed by improving the writing (apologies for insufficient clarity), and by the addition of data that to a large extent already existed or could be rapidly generated. In particularly the following data has been added:

      1. Increase to n>=7 for all fecal time-course experiments

      2. Microbiota composition analysis for all mouse lines used

      3. Data elucidating mechanisms of SPF microbiome/ host immune mechanisms restriction of acapsular B. theta

      4. Short- versus long-term recolonization of germ-free mice with a complete SPF microbiota and assessment of the effect on B. theta colonization probability.

      5. Challenge of B. theta monocolonized mice with avirulent Salmonella to disentangle effects of the host inflammatory response from other potential explanations of the observations.

      6. Details of all inocula used

      7. Resequencing of all barcoded strains

      Additionally, we have improved the clarity of the text, particularly the methods section describing mathematical modeling in the main text. Major changes in the text and particularly those replying to reviewers comment have been highlighted here and in the manuscript.

      Reviewer #1 (Public Review):

      The study addresses an important question - how the composition of the microbiota influences the intestinal colonization of encapsulated vs unencapsulated B. theta, an important commensal organism. To answer the question, the authors develop a refurbished WITS with extended mathematical modeling to quantify B. theta population bottlenecks during intestinal colonization in the setting of different microbiota. Interestingly, they show that the colonization defect of an acapsular mutant is dependent on the composition of the microbiota, suggesting (but not proving) that interactions between gut bacteria, rather than with host immune mechanisms, explains why the mutant has a colonization defect. However, it is fairly difficult to evaluate some of the claims because experimental details are not easy to find and the number of animals is very small. Furthermore, some of the analyses and claims are compromised because the authors do not fully explain their data; for example, leaving out the zero values in Fig. 3 and not integrating the effect of bottlenecks into the resulting model, undermines the claim that the acapsular mutant has a longer in vivo lag phase.

      We thank the reviewer for taking time to give this details critique of our work, and apologies that the experimental details were insufficiently explained. This criticism is well taken. Exact inoculum details for experiment are now present in each figure (or as a supplement when multiple inocula are included). Exact microbiome composition analysis for OligoMM12, LCM and SPF microbiota is now included in Figure 2 – Figure supplement 1.

      Of course, the models could be expanded to include more factors, but I think this comment is rather based on the data being insufficiently clearly explained by us. There are no “zero values missing” from Fig. 3 – this is visible in the submitted raw data table (excel file Source Data 1), but the points are fully overlapped in the graph shown and therefore not easily discernable from one another. Time-points where no CFU were recovered were plotted at a detection limit of CFU (50 CFU/g) and are included in the curve-fitting. However, on re-examination we noticed that the curve fit was carried out on the raw-data and not the log-normalized data which resulted in over-weighting of the higher values. Re-fitting this data does not change the conclusions but provides a better fit. These experiments have now been repeated such that we now have >=7 animals in each group. This new data is presented in Fig. 3C and D and Fig. 3 Supplement 2.

      Limitations:

      1) The experiments do not allow clear separation of effects derived from the microbiota composition and those that occur secondary to host development without a microbiota or with a different microbiota. Furthermore, the measured bottlenecks are very similar in LCM and Oligo mice, even though these microbiotas differ in complexity. Oligo-MM12 was originally developed and described to confer resistance to Salmonella colonization, suggesting that it should tighten the bottleneck. Overall, an add-back experiment demonstrating that conventionalizing germ-free mice imparts a similar bottleneck to SPF would strengthen the conclusions.

      These are excellent suggestions and have been followed. Additional data is now presented in Figure 2 – figure supplement 8 showing short, versus long-term recolonization of germ-free mice with an SPF microbiota and recovering very similar values of beta, to our standard SPF mouse colony. These data demonstrate a larger total niche size for B. theta at 2 days post-colonization which normalizes by 2 weeks post-colonization. Independent of this, the colonization probability, is already equivalent to that observed in our SPF colony at day 2 post-colonization. Therefore, the mechanisms causing early clonal loss are very rapidly established on colonization of a germ-free mouse with an SPF microbiota. We have additionally demonstrated that SPF mice do not have detectable intestinal antibody titers specific for acapsular B. theta. (Figure 2 – figure supplement 7), such that this is unlikely to be part of the reason why acapsular B. theta struggles to colonize at all in the context of an SPF microbiota. Experiments were also carried to detect bacteriophage capable of inducing lysis of B. theta and acapsular B. theta from SPF mouse cecal content (Figure 2 – figure supplement 7). No lytic phage plaques were observed. However, plaque assays are not sensitive for detection of weakly lytic phage, or phage that may require expression of surface structures that are not induced in vitro. We can therefore conclude that the restrictive activity of the SPF microbiota is a) reconstituted very fast in germ-free mice, b) is very likely not related to the activity of intestinal IgA and c) cannot be attributed to a high abundance of strongly lytic bacteriophage. The simplest explanation is that a large fraction of the restriction is due to metabolic competition with a complex microbiota, but we cannot formally exclude other factors such as antimicrobial peptides or changes in intestinal physiology.

      2) It is often difficult to evaluate results because important parameters are not always given. Dose is a critical variable in bottleneck experiments, but it is not clear if total dose changes in Figure 2 or just the WITS dose? Total dose as well as n0 should be depicted in all figures.

      We apologized for the lack of clarity in the figures. Have added panels depicting the exact inoculum for each figure legend (or a supplementary figure where many inocula were used). Additionally, the methods section describing how barcoded CFU were calculated has been rewritten and is hopefully now clearer.

      3) This is in part a methods paper but the method is not described clearly in the results, with important bits only found in a very difficult supplement. Is there a difference between colonization probability (beta) and inoculum size at which tags start to disappear? Can there be some culture-based validation of "colonization probability" as explained in the mathematics? Can the authors contrast the advantages/disadvantages of this system with other methods (e.g. sequencing-based approaches)? It seems like the numerator in the colonization probability equation has a very limited range (from 0.18-1.8), potentially limiting the sensitivity of this approach.

      We apologized for the lack of clarity in the methods. This criticism is well taken, and we have re-written large sections of the methods in the main text to include all relevant detail currently buried in the extensive supplement.

      On the question of the colonization probability and the inoculum size, we kept the inoculum size at 107 CFU/ mouse in all experiments (except those in Fig.4, where this is explicitly stated); only changing the fraction of spiked barcoded strains. We verified the accuracy of our barcode recovery rate by serial dilution over 5 logs (new figure added: Figure 1 – figure supplement 1). “The CFU of barcoded strains in the inoculum at which tags start to disappear” is by definition closely related to the colonization probability, as this value (n0) appears in the calculation. Note that this is not the total inoculum size – this is (unless otherwise stated in Fig. 4) kept constant at 107 CFU by diluting the barcoded B. theta with untagged B. theta. Again, this is now better explained in all figure legends and the main text.

      We have added an experiment using peak-to-trough ratios in metagenomic sequencing to estimate the B. theta growth rate. This could be usefully employed for wildtype B. theta at a relatively early timepoint post-colonization where growth was rapid. However, this is a metagenomics-based technique that requires the examined strain to be present at an abundance of over 0.1-1% for accurate quantification such that we could not analyze the acapsular B. theta strain in cecum content at the same timepoint. These data have been added (Figure 3 – figure supplement 3). Note that the information gleaned from these techniques is different. PTR reveals relative growth rates at a specific time (if your strain is abundant enough), whereas neutral tagging reveals average population values over quite large time-windows. We believe that both approaches are valuable. A few sentences comparing the approaches have been added to the discussion.

      The actual numerator is the fraction of lost tags, which is obtained from the total number of tags used across the experiment (number of mice times the number of tags lost) over the total number of tags (number of mice times the number of tags used). Very low tag recovery (less than one per mouse) starts to stray into very noisy data, while close to zero loss is also associated with a low-information-to-noise ratio. Therefore, the size of this numerator is necessarily constrained by us setting up the experiments to have close to optimal information recovery from the WITS abundance. Robustness of these analyses is provided by the high “n” of between 10 and 17 mice per group.

      4) Figure 3 and the associated model is confusing and does not support the idea that a longer lag-phase contributes to the fitness defect of acapsular B.theta in competitive colonization. Figure 3B clearly indicates that in competition acapsular B. theta experiences a restrictive bottleneck, i.e., in competition, less of the initial B. theta population is contributed by the acapsular inoculum. There is no need to appeal to lag-phase defects to explain the role of the capsule in vivo. The model in Figure 3D should depict the acapsular population with less cells after the bottleneck. In fact, the data in Figure 3E-F can be explained by the tighter bottleneck experienced by the acapsular mutant resulting in a smaller acapsular founding population. This idea can be seen in the data: the acapsular mutant shedding actually dips in the first 12-hours. This cannot be discerned in Figure 3E because mice with zero shedding were excluded from the analysis, leaving the data (and conclusion) of this experiment to be extrapolated from a single mouse.

      We of course completely agree that this would be a correct conclusion if only the competitive colonization data is taken into account. However, we are also trying to understand the mechanisms at play generating this bottleneck and have investigated a range of hypotheses to explain the results, taking into account all of our data.

      Hypothesis 1) Competition is due to increased killing prior to reaching the cecum and commencing growth: Note that the probability of colonization for single B. theta clones is very similar for OligoMM12 mouse single-colonization by the wildtype and acapsular strains. For this hypothesis to be the reason for outcompetition of the acapsular strain, it would be necessary that the presence of wildtype would increase the killing of acapsular B. theta in the stomach or small intestine. The bacteria are at low density at this stage and stomach acid/small intestinal secretions should be similar in all animals. Therefore, this explanation seems highly unlikely

      Hypothesis 2) Competition between wildtype and acapsular B. theta occurs at the point of niche competition before commencing growth in the cecum (similar to the proposal of the reviewer). It is possible that the wildtype strain has a competitive advantage in colonizing physical niches (for example proximity to bacteria producing colicins). On the basis of the data, we cannot exclude this hypothesis completely and it is challenging to measure directly. However, from our in vivo growth-curve data we observe a similar delay in CFU arrival in the feces for acapsular B. theta on single colonization as in competition, suggesting that the presence of wildtype (i.e., initial niche competition) is not the cause of this delay. Rather it is an intrinsic property of the acapsular strain in vivo,

      Hypothesis 3) Competition between wildtype and acapsular B. theta is mainly attributable to differences in growth kinetics in the gut lumen. To investigate growth kinetics, we carried our time-courses of fecal collection from OligoMM12 mice single-colonized with wildtype or acapsular B. theta, i.e., in a situation where we observe identical colonization probabilities for the two strains. These date, shown now in Figure 3 C and D and Figure 3 – figure supplement 2, show that also without competition, the CFU of acapsular B. theta appear later and with a lower net growth rate than the wildtype. As these single-colonizations do not show a measurable difference between the colonization probability for the two strains, it is not likely that the delayed appearance of acapsular B. theta in feces is due to increased killing (this would be clearly visible in the barcode loss for the single-colonizations). Rather the simplest explanation for this observation is a bona fide lag phase before growth commences in the cecum. Interestingly, using only the lower net growth rate (assumed to be a similar growth rate but increased clearance rate) produces a good fit for our data on both competitive index and colonization probability in competition (Figure 3, figure supplement 5). This is slightly improved by adding in the observed lag-phase (Figure 3). It is very difficult to experimentally manipulate the lag phase in order to directly test how much of an effect this has on our hypothesis and the contribution is therefore carefully described in the new text.

      Please note that all data was plotted and used in fitting in Fig 3E, but “zero-shedding” is plotted at a detection limit and overlayed, making it look like only one point was present when in fact several were used. This was clear in the submitted raw data tables. To sure-up these observations we have repeated all time-courses and now have n>=7 mice per group.

      5) The conclusions from Figure 4 rely on assumptions not well-supported by the data. In the high fat diet experiment, a lower dose of WITS is required to conclude that the diet has no effect. Furthermore, the authors conclude that Salmonella restricts the B. theta population by causing inflammation, but do not demonstrate inflammation at their timepoint or disprove that the Salmonella population could cause the same effect in the absence of inflammation (through non-inflammatory direct or indirect interactions).

      We of course agree that we would expect to see some loss of B. theta in HFD. However, for these experiments the inoculum was ~109 CFUs/100μL dose of untagged strain spiked with approximately 30 CFU of each tagged strain. Decreasing the number of each WITS below 30 CFU leads to very high variation in the starting inocula from mouse-to-mouse which massively complicates the analysis. To clarify this point, we have added in a detection-limit calculation showing that the neutral tagging technique is not very sensitive to population contractions of less than 10-fold, which is likely in line with what would be expected for a high-fat diet feeding in monocolonized mice for a short time-span.

      This is a very good observation regarding our Salmonella infection data. We have now added the fecal lipocalin 2 values, as well as a group infected with a ssaV/invG double mutant of S. Typhimurium that does not cause clinical grade inflammation (“avirulent”). This shows 1) that the attenuated S. Typhimurium is causing intestinal inflammation in B. theta colonized mice and 2) that a major fraction of the population bottleneck can be attributed to inflammation. Interestingly, we do observe a slight bottleneck in the group infected with avirulent Salmonella which could be attributable either to direct toxicity/competition of Salmonella with B. theta or to mildly increased intestinal inflammation caused by this strain. As we cannot distinguish these effects, this is carefully discussed in the manuscript.

      6) Several of the experiments rely on very few mice/groups.

      We have increased the n to over 5 per group in all experiments (most critically those shown in Fig 3, Supplement 5). See figure legends for specific number of mice per experiment.

      Reviewer #2 (Public Review):

      The goal of this study was to understand population bottlenecks during colonization in the context of different microbial communities. Capsular polysaccharide mutants, diet, and enteric infection were also used paired to short-term monitoring of overall colonization and the levels of specific strains. The major strength of this study is the innovative approach and the significance of the overall research area.

      The first major limitation is the lack of clear and novel insight into the biology of B. theta or other gut bacterial species. The title is provocative, but the experiments as is do not definitively show that the microbiota controls the relative fitness of acapsular and wild-type strains or provide any mechanistic insights into why that would be the case. The data on diet and infection seem preliminary. Furthermore, many of the experiments conflict with prior literature (i.e., lack of fitness difference between acapsular and wild-type strain and lack of impact of diet) but satisfying explanations are not provided for the lack of reproducibility.

      In line with suggestions from Reviewer 1, the paper has undergone quite extensive re-writing to better explain the data presented and its consequences. Additionally, we now explicitly comment on apparent discrepancies between our reported data and the literature – for example the colonization defect of acapsular B. theta is only published for competitive colonizations, where we also observe a fitness defect so there is no actual conflict. Additionally, we have calculated detection limits for the effect of high-fat diet and demonstrate that a 10-fold reduction in the effective population size would not be robustly detected with the neutral tagging technique such that we are probably just underpowered to detect small effects, and we believe it is important to point out the numerical limits of the technique we present here. Additionally for the Figure 4 experiments, we have added data on colonization/competition with an avirulent Salmonella challenge giving some mechanistic data on the role of inflammation in the B. theta bottleneck.

      Another major limitation is the lack of data on the various background gut microbiotas used. eLife is a journal for a broad readership. As such, describing what microbes are in LCM, OligoMM, or SPF groups is important. The authors seem to assume that the gut microbiota will reflect prior studies without measuring it themselves.

      All gnotobiotic lines are bred as gnotobiotic colonies in our isolator facility. This is now better explained in the methods section. Additionally, 16S sequencing of all microbiotas used in the paper has been added as Figure 2 – figure supplement 1.

      I also did not follow the logic of concluding that any differences between SPF and the two other groups are due to microbial diversity, which is presumably just one of many differences. For example, the authors acknowledge that host immunity may be distinct. It is essential to profile the gut microbiota by 16S rRNA amplicon sequencing in all these experiments and to design experiments that more explicitly test the diversity hypotheses vs. alternatives like differences in the membership of each community or other host phenotypes.

      This is an important point. We have carried out a number of experiments to potentially address some issues here.

      1) We carried out B. theta colonization experiments in germ-free mice that had been colonized by gavage of SPF feces either 1 day prior to colonization of 2 weeks prior to colonization. While the shorter pre-colonization allowed B. theta to colonize to a higher population density in the cecum, the colonization probability was already reduced to levels observed in our SPF colony in the short pre-colonization. Therefore, the factors limiting B. theta establishment in the cecum are already established 1-2 days post-colonization with an SPF microbiota (Figure 2 - figure supplement 8). 2) We checked for the presence of secretory IgA capable of binding to the surface of live B. theta, compared to a positive control of a mouse orally vaccinated against B. theta. (Fig. 2, Supplement 7) and could find no evidence of specific IgA targeting B. theta in the intestinal lavages of our SPF mouse colony. 3) We isolated bacteriophage from the intestine of SPF mice and used this to infect lawns of B. theta wildtype and acapsular in vitro. We could not detect and plaque-forming phage coming from the intestine of SPF mice (Figure 2 – figure supplement 7).

      We can therefore exclude strongly lytic phage and host IgA as dominant driving mechanisms restricting B. theta colonization. It remains possible that rapidly upregulated host factors such as antimicrobial peptide secretion could play a role, but metabolic competition from the microbiota is also a very strong candidate hypothesis. The text regarding these experiments has been slightly rewritten to point out that colonization probability inversely correlates with microbiota complexity, and the mechanisms involved may involve both direct microbe-microbe interactions as well as host factors.

      Given the prior work on the importance of capsule for phage, I was surprised that no efforts are taken to monitor phage levels in these experiments. Could B. theta phage be present in SPF mice, explaining the results? Alternatively, is the mucus layer distinct? Both could be readily monitored using established molecular/imaging methods.

      See above: no plaque-forming phage could be recovered from the SPF mouse cecum content. The main replicative site that we have studied here, in mice, is the cecum which does not have true mucus layers in the same way as the distal colon and is upstream of the colon so is unlikely to be affected by colon geography. Rather mucus is well mixed with the cecum content and may behave as a dispersed nutrient source. There is for sure a higher availability of mucus in the gnotobiotic mice due to less competition for mucus degradation by other strains. However, this would be challenging to directly link to the B. theta colonization phenotype as Muc2-deficient mice develop intestinal inflammation.

      The conclusion that the acapsular strain loses out due to a difference of lag phase seems highly speculative. More work would be needed to ensure that there is no difference in the initial bottleneck; for example, by monitoring the level of this strain in the proximal gut immediately after oral gavage.

      This is an excellent suggestion and has been carried out. At 8h post-colonization with a high inoculum (allowing easy detection) there were identical low levels of B. theta in the upper and lower small intestine, but more B. theta wildtype than B. theta acapsular in the cecum and colon, consistent with commencement of growth for B. theta wildtype but not the acapsular strain at this timepoint. We have additionally repeated the single-colonization time-courses using our standard inoculum and can clearly see the delayed detection of acapsular B. theta in feces even in the single-colonization state when no increased bottleneck is observed. This can only be reasonably explained by a bona fide lag-phase extension for acapsular B. theta in vivo. These data also reveal and decreased net growth rate of acapsular B. theta. Interestingly, our model can be quite well-fitted to the data obtained both for competitive index and for colonization probability using only the difference in net growth rate. Adding the (clearly observed) extended lag-phase generates a model that is still consistent with our observations.

      Another major limitation of this paper is the reliance on short timepoints (2-3 days post colonization). Data for B. theta levels over 2 weeks or longer is essential to put these values in context. For example, I was surprised that B. theta could invade the gut microbiota of SPF mice at all and wonder if the early time points reflect transient colonization.

      It should be noted that “SPF” defines microbiota only on missing pathogens and not on absolute composition. Therefore, the rather efficient B. theta colonization in our SPF colony is likely due to a permissive composition and this is likely to be not at all reproducible between different SPF colonies (a major confounder in reproducibility of mouse experiments between institutions. In contrast the gnotobiotic colonies are highly reproducible). We do consistently see colonization of our SPF colony by wildtype B. theta out to at least 10 days post-inoculation (latest time-point tested) at similar loads to the ones observed in this work, indicating that this is not just transient “flow-through” colonization. Data included below:

      For this paper we were very specifically quantifying the early stages of colonization, also because the longer we run the experiments for, the more confounding features of our “neutrality” assumptions appear (e.g., host immunity selecting for evolved/phase-varied clones, within-host evolution of individual clones etc.). For this reason, we have used timepoints of a maximum of 2-3 days.

      Finally, the number of mice/group is very low, especially given the novelty of these types of studies and uncertainty about reproducibility. Key experiments should be replicated at least once, ideally with more than n=3/group.

      For all barcode quantification experiments we have between 10 and 17 mice per group. Experiments for the in vivo time-courses of colonization have been expanded to an “n” of at least 7 per group.

    1. Author Response:

      Reviewer #1 (Public Review):

      The authors report the generation of a mesoscale excitatory projectome from the ventrolateral prefrontal cortex (vlPFC) in the macaque brain by using AAV2/9-CaMKIIa-Tau-GFP labeling and imaging with high-throughput serial two-photon tomography. They present a novel data pipeline that integrates the STP data with macroscopic dMRI data from the same brain in a common 3D space, achieving a direct comparison of the two tracing methods. The analysis of the data revealed an interesting discrepancy between the high resolution STP data and the lower resolution dMRI data with respect to the extent of the frontal lobe projection through the inferior fronto-occipital fasciculus (IFOF) - the longest associative axon bundle in the human brain.

      The authors report the generation of a mesoscale excitatory projectome from the ventrolateral prefrontal cortex (vlPFC) in the macaque brain by using AAV2/9-CaMKIIa-Tau-GFP labeling and imaging with high-throughput serial two-photon tomography. They also present a novel data pipeline that integrates the STP data with macroscopic dMRI data from the same brain in a common 3D space, achieving a direct comparison of the two tracing methods. Overall the paper can serve as a how to example for analyzing large non-human primate brain data, though some parts of the paper can be improved and the interpretation of the data should also be further strengthened.

      We thank the reviewer for his positive evaluation of our manuscript.

      The methodological part should include more detail on image acquisition - speed of imaging, pixel residence time, total time for data acquisition of a single brain and data sizes. Also the time and hardware needed for the computational analysis should be included, including the registration to the common reference and the running time for the machine learning predictions - this should also include the F score for the axon detection.

      We thank the reviewer for pointing out these vital issues. We have added these technical details in the resubmitted manuscript.

      “High x-y resolution (0.95 μm/pixel) serial 2D images were acquired in the coronal plane at a z-interval of 200 μm across the entire macaque brain. The scanning time of a single field-of-view which contains 1024 by 1024 pixels was 1.629 s (i.e., pixel residence time was ~1.6 μs), as resulted in a continuous ~1 month scanning and ~5 TB STP tomography data for a single monkey brain.”

      “The data analysis was undertaken on a compute cluster with a 3.1 - 3.3 GHz 248 core CPU, 2.8 T of RAM, and 17472 CUDA cores.”

      “The total computational time for the machine learning predictions in one macaque brain was ~ 1.5 months.”

      “To evaluate overall classifier performance, the precision–recall F measure, also called F-score, was computed by using additional four labeled images as test sets. Higher accuracy performance achieved by the classifier often yield higher F-scores (94.41% ± 1.99%, mean ± S.E.M.).”

      “For registration to the 3D common space, it took half an hour approximately.”

      The discrepancy between the high resolution STP data and the lower resolution dMRI data with respect to the extent of the frontal lobe projection through the inferior fronto-occipital fasciculus seems puzzling. One would expect that the STP data would reveal more detail not less.. One possibility is that the Tau-GFP does not diffuse throughout the full axon arborization of the PFC neurons, resulting in a technical artifact. Can this be excluded to support the functional significance of the current data?

      We thank the reviewer for raising this important issue. We apologize for not providing sufficient details of the IFOF debate due to limited space and causing confusion. We have added literature background of the IFOF debate to the section of Introduction (also recommended by Reviewer #2). Thanks to the comments by Reviewer #2, the present finding provides direct support for the speculation that the IFOF of macaque monkeys may not exist in a mono-synaptic way.

      The AAV construct encoding cytoskeletal GFP (Tau-GFP) was used here to label all processes of the infected neuron, including axons and synaptic terminals. About 3 weeks of post-surgery survival time are usually sufficient to label intracerebral circuits in rodents (Lanciego and Wouterlood, 2020). We have extended the survival time to 2-3 months in order to achieve adequate labeling of axonal fibers and terminals in macaques.

      Regarding the extent of Tau-GFP diffuse, the STP images and high-resolution confocal microscopic analysis further showed differences in the morphology of axon fibers that populate the route and terminals of these axon fibers. Consistent with previous reports (Fuentes-Santamaria et al., 2009; Watakabe and Hirokawa, 2018), the axon fibers were thin and formed bouton-like varicosities in the terminal regions (MD, Figure 2—figure supplement 7D; caudate, Figure 2—figure supplement 7J; PFC, Figure 1—figure supplement 5A-D). Those results indicate that the Tau-GFP has reached axonal terminals.

      References:

      Fuentes-Santamaria V, Alvarado JC, McHaffie JG, Stein BE (2009) Axon Morphologies and Convergence Patterns of Projections from Different Sensory-Specific Cortices of the Anterior Ectosylvian Sulcus onto Multisensory Neurons in the Cat Superior Colliculus. Cereb Cortex 19:2902-2915.

      Lanciego JL, Wouterlood FG (2020) Neuroanatomical tract-tracing techniques that did go viral. Brain Struct Funct 225:1193-1224.

      Watakabe A, Hirokawa J (2018) Cortical networks of the mouse brain elaborate within the gray matter. Brain Struct Funct 223:3633-3652.

      Reviewer #2 (Public Review):

      The authors utilized viral vectors as neural tracers to delineate the connectivity map of the macaque vlPFC at the axonal level. There are three main goals of this study: 1) determine an effective viral vector for tract-tracing in the macaque brain, 2) delineate the detailed map of excitatory vlPFC projections to the rest of the brain, and 3) compare vlPFC connectivity between tracing and tractography results.

      We thank the reviewer for his/her constructive comments, to which we respond below.

      Accordingly, my comments are organized around each aim:

      1) This study demonstrates the advantage of viral tracing technique in targeting neuron type-specific pathways. The authors conducted injection experiments with three types of viral vectors and found success of AAV in labeling long-distance connections without causing fatal neurotoxicity in the monkey. This success extends the application of AAV from rodents to nonhuman primates. The fact that AAV specifically targets glutamatergic neurons makes it advantageous for mapping excitatory projections.

      Although the labeling efficacy of each viral vector type is described in the text, Fig. 2 does not present a clear comparison across viral vectors, despite such comparison for a thalamic injection in Fig. 2S. Without a comparable graph to Fig. 2E, it is unclear to what extent the VSV and lentivirus failed in labeling long-distance pathways.

      We thank the reviewer for the helpful suggestion. As suggested, we have added three new figures as Supplementary materials in the revised manuscript.

      Figure 2—figure supplement 2. Expression of GFP using VSV-△G injected into MD thalamus of the macaque brain. (A) GFP-labeled neurons were found in the MD thalamus ~5 days after injection of VSV-△G encoding Tau-GFP. (B) A magnified view illustrating the morphology of GFP-labeled neurons in the area outlined with a white box in (A). (C) Higher magnification view of GFP-positive axons.

      Figure 2—figure supplement 3. Expression of GFP using lentivirus injected into MD thalamus of the macaque brain. (A) Lentivirus construct was injected into the macaque thalamus and examined for transgene expression after ~9 months. (B) High power views of the dotted rectangle in panel A. (C) Magnified view of panel B. Note the presence of GFP-positive cells.

      Figure 2—figure supplement 4. Expression of GFP using AAV2/9 injected into MD thalamus of the macaque brain. (A) GFP-labeled axons were observed in the subcortical regions ~42 days after injection of AAV2/9 encoding Tau-GFP in MD thalamus. The inset shows the injection site in MD thalamus. Two dashed line boxes enclose the regions of interest: frontal white matter and ALIC, whose GFP signal are magnified in (B) and (C), respectively. (D) Higher magnification view of GFP-positive axons.

      2) The authors quantified connectivity strength by the GFP signal intensity using a machine-learning algorithm. Both the quantitative approach and the resulting excitatory projection map are important contributions to advancing our knowledge of vlPFC connectivity.

      However, several issues with the analysis lead to concerns about the connectivity result. First, the strength measure is based on axonal patterns in the terminal fields (which the authors refer to as "axon clusters"), detected by a machine-learning algorithm (page 25, lines 11-13). However, the actual synaptic connections are the small dot-looking signals in the background. These "green dots" are boutons on the dendritic trees. The density of boutons rather than the passing fibers reflects the density of synapses. The brief method description does not mention how the boutons are quantified, and it is unclear whether the signal was treated as the background noise and filtered out. Second, it is difficult for the reader to assess the robustness of the vlPFC connectivity patterns, due to these issues: i) It is unclear how many injection cases were used to generate the result reported in the subsection "Brain-wide excitatory projectome of vlPFC in macaques". The text mentions a singular "injection site" (page 8, line 12) and Fig. 4 shows a single site. However, there are three cases listed in Table 1. Is the result an average of all three cases? ii) Relatedly, it is unclear in which anatomical area the injection was placed for each case. Table 1 lists the site as "vlPFC" for all three cases, while the vlPFC contains areas 44, 45 and 12l. These areas have different projection patterns documented in the tract tracing literature. If different areas were injected in the three cases, they should be reported separately. iii) It is hard to compare the projection patterns with those reported in the literature. Conventionally, tract tracing studies report terminal fields by showing original labeling patterns in both cortical and subcortical regions without averaging within divided areas (see e.g. Petrides & Pandya, 2007, J Neurosci). It is hard to compare Fig. 3 with previous tract tracing studies to assess its robustness.

      We thank the reviewer for his/her constructive comments, to which we respond below.

      1). We appreciate the reviewer’s comment and sincerely apologize for not explaining this point clearly in our previous submission. The major concern is whether the axonal varicosities were likely to be treated as the background noise and removed by mistake. In fact, the dot-looking autofluorescence rather than the axonal varicosities were reduced through a machine-learning algorithm in segmentation. Hence we have provided new results and updated the “Materials and Methods” and “Discussion” sections in the revision accordingly.

      “Fluorescent images of primate (Abe et al., 2017) brain often contain high-intensity dot-looking background signal caused by accumulation of lipofuscin. Thanks to the broad emission spectrum of lipofuscin, dot-looking background and GFP-positive axonal varicosities are easily distinguishable from each other. For instance (Figure 1—figure supplement 4), axonal varicosities can be selectively excited in green channel, while dot-looking background lipofuscin usually present in both green channel and red channel. During quantitative analysis, a machine learning algorithm was adopted to reliably segment the GFP labelled axonal fibers including axonal varicosities, and remove the lipofuscin background (Arganda-Carreras et al., 2017; Gehrlach et al., 2020).”

      “One recent study compared results of terminal labelling using Synaptophysin-EGFP-expressing AAV (specifically labelling synaptic endings) with the cytoplasmic EGFP AAV (labelling axon fibers and synaptic endings). There was high correspondence between synaptic EGFP and cytoplasmic EGFP signals in target regions (Oh et al., 2014). Thus, we relied on quantifying GFP-positive pixels (containing signals from both axonal fibers and terminals) rather than the number of synaptic terminals, similarly done in recent reports (Oh et al., 2014; Gehrlach et al., 2020).”

      Figure 1—figure supplement 4. Difference between axonal varicosities and dot-looking background. STP images (A-D) and high-resolution confocal images (E-H) were acquired in green channel and the red channel. Synaptic terminals (indicated by white arrows) can be specifically excited in green channel, while dot-looking background lipofuscin (indicated by yellow arrows) can be visualized both in green channel and red channel. (C and G) No colocalization was found between axonal varicosities and dot-looking background. Axonal varicosities were easily distinguished from dot-looking background in the merged image. (D and H) The dot-looking autofluorescence rather than the axonal varicosities was reduced through a machine-learning algorithm.

      References:

      Abe H, Tani T, Mashiko H, Kitamura N, Miyakawa N, Mimura K, Sakai K, Suzuki W, Kurotani T, Mizukami H, Watakabe A, Yamamori T, Ichinohe N (2017) 3D reconstruction of brain section images for creating axonal projection maps in marmosets. J Neurosci Methods 286:102-113.

      Arganda-Carreras I, Kaynig V, Rueden C, Eliceiri KW, Schindelin J, Cardona A, Sebastian Seung H (2017) Trainable Weka Segmentation: a machine learning tool for microscopy pixel classification. Bioinformatics 33:2424-2426.

      Gehrlach DA, Weiand C, Gaitanos TN, Cho E, Klein AS, Hennrich AA, Conzelmann KK, Gogolla N (2020) A whole-brain connectivity map of mouse insular cortex. Elife 9.

      Oh SW et al. (2014) A mesoscale connectome of the mouse brain. Nature 508:207-214.

      2.1) We apologize for causing these confusions due to insufficient description in the main text. Now we have revised the description of the “Materials and Methods” section accordingly. Furthermore, we have made both the whole-brain serial two-photon data and high-resolution diffusion MRI data freely available to the community, as allows researchers in the field to perform further analyses that we have not done in the current study.

      “Three samples were injected with AAV in vlPFC, and two of them were able to be imaged with STP. Unfortunately, one sample became “loose” and fell off from the agar block after several weeks of imaging. So, the quantitative results were not shown in Figure 3.”

      2.2) We apologize for insufficient description of the precise location of the injection sites. We have revised the description of “Materials and Methods” section and provided a new figure to clarify the exact location of the injection sites.

      “Figure 3-4 and Figure 4—figure supplement 2-4 were derived from sample #8 with infected area in 45, 12l and 44 of vlPFC. Figure 1—figure supplement 6 was derived from sample #7 with infected area in 12l and 45 of vlPFC.”

      Figure 1—figure supplement 6. Representative fluorescent images showing injection site and major tracts of sample #7. (A) STP image of the injection site in vlPFC are shown overlaid with the monkey brain template (left hand side), mainly spanning areas 12l and 45a. (B) Confocal image of the AAV infected neurons (indicated by white arrows). (C-F) Representative confocal images of major tracts originating from vlPFC.

      2.3) We agree with the reviewer that most tract tracing studies report terminal fields by showing original labeling patterns. Several recent studies report the total volume of segmented GFP-positive pixels (Oh et al., 2014) or percentage of total labeled axons (Do et al., 2016; Gehrlach et al., 2020) to represent the connectivity strength, and other studies provide the projection density as well (Hunnicutt et al., 2016). We have provided both percentage of total labeled axons (Figure 3C right panel), projection density (Figure 3C left panel) and representative original fluorescent images (Figure. 4, Figure 4—figure supplement 2 and Figure 4—figure supplement 4) to demonstrate our projection data at different dimensions.

      References:

      Do JP, Xu M, Lee SH, Chang WC, Zhang S, Chung S, Yung TJ, Fan JL, Miyamichi K, Luo L, Dan Y (2016) Cell type-specific long-range connections of basal forebrain circuit. Elife 5.

      Gehrlach DA, Weiand C, Gaitanos TN, Cho E, Klein AS, Hennrich AA, Conzelmann KK, Gogolla N (2020) A whole-brain connectivity map of mouse insular cortex. Elife 9.

      Hunnicutt BJ, Jongbloets BC, Birdsong WT, Gertz KJ, Zhong H, Mao T (2016) A comprehensive excitatory input map of the striatum reveals novel functional organization. Elife 5.

      Oh SW et al. (2014) A mesoscale connectome of the mouse brain. Nature 508:207-214.

      3) Using the ground-truth from tract tracing to validate tractography results is a timely problem and this study showed promising consistency and discrepancy between the two modalities. Especially, the discrepancy between tracing and tractography data on the IFOF termination brings critical insights into a potential cross-species difference. The finding that IFOF does not reach the occipital cortex provides important support for the speculation that IFOF may not exist in monkeys (for a context of the IFOF debate see Schmahmann & Pandya, 2006, pp 445-446).

      I have minor concerns regarding the statistical robustness of the tracing-tractography comparison. The authors compared the vlPFC-CC-contralateral tract instead of a global connectivity pattern without justification. Why omitting other major tracts that connect with vlPFC? In addition, the results are shown for only one monkey, while two monkeys went through both tracer injection and dMRI scans. It is unclear how the results were chosen or whether the data were averaged.

      We apologize for not describing it clearly. The STP images were acquired in the coronal plane with high x-y resolution (0.95 μm/pixel), while the z resolution was relatively low (200 μm). The axonal connection information along z axis may be lost due to the present step size (relatively large) such that it is technically demanding to reconstruct the axonal density maps in sagittal or horizontal plane. Therefore, we focused on the vlPFC-CC-contralateral tract traveling along the coronal plane when quantifying the similarity coefficients along the anterior-posterior axis of the whole macaque brain, and omitted the tracts that were shown as dots in the coronal plane. We have revised it in the resubmitted manuscript.

      “GFP projection and probabilistic tract were plotted with the Dice coefficients and Pearson coefficients (R) along the anterior-posterior axis of the whole macaque brain. The Dice coefficients and Pearson coefficients were higher in dense projection regions, especially for the vlPFC-CC-contralateral tract (Figure 6A). To carry out a proof-of-principle investigation, we focused on the vlPFC-CC-contralateral tract that was reconstructed in 3D space by using STP and dMRI data, respectively.”

      With regard to the demonstration of dMRI data, we apologize for not making it clear in previous version. We have already revised Figure 6 and Figure 7 so that dMRI scans from different macaque monkeys were shown separately.

      Figure 6. Comparison of vlPFC connectivity profiles by STP tomography and diffusion tractography. (A) Percentage of projection, Probabilistic tracts, Dice coefficients and Pearson coefficients (R) were plotted along the anterior-posterior axis in the macaque brain. Blue and red colors indicate results of two dMRI data sets acquired from different macaque monkeys. (B, C) 3D visualization of the fiber tracts issued from the injection site in vlPFC to corpus callosum to the contralateral vlPFC by STP tomography and diffusion tractography. (D-F) Representative coronal slices of the diffusion tractography map and the axonal density map along the vlPFC-CC-contralateral tract, overlaid with the corresponding anatomical MR images. (G-J) GFP-labeled axon images as marked in Figure 6F were shown with magnified views. (H, J) correspond to high magnification images of the white boxes indicated in G and I, both of which presented a great deal of details about axonal morphology.

      Figure 7. Illustration of the inferior fronto-occipital fasciculus by diffusion tractography and STP. (A) The fiber tractography of IFOF (lateral view). Two inclusion ROIs at the external capsule (pink) and the anterior border of the occipital lobe (purple) were used and shown on the coronal plane. The IFOF stems from the frontal lobe, travels along the lateral border of the caudate nucleus and external/extreme capsule, forms a bowtie-like pattern and anchors into the occipital lobe. (B) The reconstructed traveling course of IFOF based on vlPFC projectome was shown in 3D space. (C) The Szymkiewicz-Simpson overlap coefficients between 2D coronal brain slices of the dMRI-derived IFOF tract and vlPFC projections were plotted along the anterior-posterior axis of the macaque brain. Blue and red colors indicate results of two dMRI data sets acquired from different macaque monkeys. Four cross-sectional slices (D-G) along the IFOF tracts were arbitrarily chosen to demonstrate the spatial correspondence between the diffusion tractography and axonal tracing of STP images. (D-G) The detected GFP signals (green) of vlPFC projectome and the IFOF tracts (red) obtained by diffusion tractography were overlaid on anatomical MRI images, with a magnified view of the box area. Evidently there was no fluorescent signal detected in the superior temporal area where the dMRI-derived IFOF tract passes through (G).

    1. Author Response

      Reviewer #1 (Public Review):

      This is a well-executed study using cutting-edge proteomics analysis to characterize muscle tissue from a genetically diverse mouse population. The use of only females in the study is a serious limitation that the authors acknowledge. The statistical methods, including protein quantification, QTL mapping, and trait correlation analysis are appropriate and include corrections for multiple testing. One concern is that missense variants, if they occur in peptides used to quantify proteins, could lead to false-positive signatures of low abundance (see lines 123-127). The experimental validation and deep dive into UFMylation provide some confidence in the reliability of other associations that can be mined from these data. The authors have provided a web-based tool for exploring the data.

      We thank the reviewer for these very positive comments and for reviewing the manuscript.

      We agree the quantification of peptides containing missense variants could confound quantification at the protein level. This is an important consideration when there are only a few peptides identified for a specific protein. However, in our data the average number of peptides used to quantify the 14 proteins containing missense-associated pQTLs was ~68 peptides/protein (lowest was 5 peptides for FGB and highest 703 peptides for NEB).

      In the case of EPHX1, we quantified 15 peptides (Figure R1A). We identified a peptide adjacent to R338 spanning amino acids 339-347. As such, mutation of R338C would prevent trypsin from cleavage resulting in the missense peptide not being identified and may lead to false-positive signatures of low abundance as suggested by the reviewer. To investigate this, we re-quantified EPHX1 relative protein abundance with or without the peptide spanning 339-347 for each genotype (Figure R1B). This made little difference to protein quantification and EPHX1 abundance was still significantly lower following mutation of R338C (AA genotype). In fact, quantification at the peptide-level revealed 12 out of the remaining 14 peptides were also significantly lower in AA genotype (data not shown).

      Although we agree this a very important consideration, we are mindful of the length of the article and feel including these data would not significantly improve the manuscript. We therefore request to not include these data as it would detract from the main findings of the paper focused on phenotypic associations and validation of UFMylation as a regulator of muscle function.

      Figure 1R. (A) Identified peptides from EPHX1 mapped onto primary amino acid sequence highlighting the missense mutation induced by SNP rs32746574 that was associated to EPHX1 protein levels by pQTL analysis. (B) Relative quantification of EPHX1 between the two genotypes of SNP rs32746574 with and without the peptide neighboring the missense mutation (amino acids 339-347) (**p<0.001, students t-test)

    1. Author Response

      Reviewer #1 (Public Review):

      Building upon the previous evidence of activation of auditory cortex VIP interneurons in response to non-classical stimuli like reward and punishment, Szadai et al., extended the investigation to multiple cortical regions. Use of three-dimensional acousto-optical two-photon microscopy along with the 3D chessboard scanning method allowed high-speed signal acquisition from numerous VIP interneurons in a large brain volume. Additionally, activity of VIP interneurons in deep cortical regions was obtained using fiber photometry. With the help of these two imaging methods authors were able to extract and analyze the VIP cell signal from different cortical regions. Study of VIP interneuron activity during an auditory go-no-go task revealed that more than half of recorded cortical VIP interneurons were responding to both reward and punishment with high reliability. Fiber photometry data revealed similar observations; however, the temporal dynamics of reinforcement stimuli-related response in mPFC was slower than in the auditory cortex. The authors performed detailed analysis of individual cell activity dynamics, which revealed five categories of VIP cells based on their temporal profiles. Further, animals with higher performance on the discrimination task showed stronger VIP responses to 'go trials' possibly suggesting the role of VIP interneurons in discrimination learning. Authors found that reinforcement related response of VIP interneurons in visual cortex was not correlated with their sensory tuning, unveiling an interesting idea that VIP interneurons take part in both local as well as global processing. These observations bring attention to the possible involvement of VIP interneurons in reinforcement stimuli-associated global signaling that would regulate local connectivity and information processing leading to learning.

      The state-of-the-art imaging technique allowed authors to succeed in imaging VIP interneurons from several cortical regions. Advanced analyses revealed the nuances, similarities and differences in the VIP activity trend in various regions. The conclusions about reinforcement stimuli related activity of VIP interneurons made by the authors are well supported by the results obtained, however some claims and interpretations require more attention and clarification.

      We thank Reviewer #1 for the positive general comments.

      Reviewer #2 (Public Review):

      In recent years the activity of cortical VIP+ interneurons in relation to learning and sensory processing has raised great interest and has been intensely investigated. The ability of VIP+ interneurons in the auditory cortex to respond to both reward and punishment was already reported a few years ago by some of the authors (Pi et al., 2013, Nature). However, this work importantly adds to their previous study demonstrating a largely similar and synchronous response of a large fraction of these interneurons across the neocortex to salient stimuli of different valence during the performance of an auditory discrimination task.

      An additional strength of this study is the analysis and identification of the general pattern of VIP+ interneuron responses associated to specific behaviors in the different layers of the neocortex depth.

      Interestingly, the authors also identified using cluster analysis 5 different classes of VIP+ interneurons, based on the dynamic of their responses, that were unequally distributed in distinct cortical areas.

      This is a well performed study that took advantage of a cutting-edge imaging approach with high recording speed and good signal-to-noise ratio. Experiments are well performed and the data are properly analyzed and nicely illustrated. However, one shortcoming of this paper, in my opinion, is the "case report" structure of the data. Essentially for each neocortical area the activity of VIP+ interneurons was analyzed only in one animal. This limits the assessment of the stability of the response/recruitment of these interneurons. I appreciate the high number of recorded VIP+ interneurons per area/animal and I do understand that it would be excessively laborious to perform 3D random-access two-photon microscopy in several mice for each cortical area. On the other hand, it would be important to have some knowledge of the general variability of the responses of these neurons among animals.

      In conclusion, despite the findings described in this manuscript being generally sound, additional experiments are recommended to further substantiate the conclusions.

      Thank you for pointing out this potential misunderstanding. Although we mentioned the number of animals the recordings were obtained from (n=22 total), we repeated this multiple times to alleviate the potential confusion. The data recorded with the 2-photon microscope are from 16 animals, and fiber photometry was performed on a separate 6 animals. Each animal was recorded in one (14 mice) or two areas (8 mice, 2 AOD, 6 photometry). We aimed to acquire data from at least 3 recordings per area (4 in the primary somatosensory cortex, 6 in the primary and secondary motor cortices, 4 in the lateral and medial parietal cortices, 3 in the primary visual cortices, 6 in the auditory and medial prefrontal cortices). In the revised manuscript this information can be found at the beginning of the results section and in the figure legends:

      “To probe the behavioral function of VIP interneurons, we trained head-fixed mice (n=22 in total, n=16 for 2-photon microscopy and n=6 for fiber photometry) on a simple auditory discrimination task (Figure 1A).”

      “Among the 811 neurons imaged in 18 imaging sessions from 16 mice,”

      “Ca2+ responses of individual VIP interneurons recorded separately from 18 different cortical regions from 16 mice using fast 3D AO imaging were averaged for Hit (thick green), FA (thick red), Miss (dark blue), and CR (light blue). Fiber photometry data were recorded simultaneously from mPFC and ACx regions and are shown in gray boxes. Functional map (Kirkcaldie, 2012) used with the permission of the author. Speaker symbols represent the average time of tone onset, and gray triangles mark the reinforcement onset for Hit and FA. Averages of Miss and CR trials were aligned according to the expected reinforcement delivery calculated on the basis of the average reaction time. mPFC: medial prefrontal cortex (n=6 mice), ACx: auditory cortex (n=6), S1Hl/S1Tr/S1Bf/S1Sh: primary somatosensory cortex, hindlimb/trunk/barrel field/shoulder region (n=4), M1/M2: primary/secondary motor cortex (n=6), Mpta/Lpta: medial/lateral parietal cortex (n=4), V1: primary visual cortex (n=3).”

      “This approach allowed us to simultaneously measure bulk calcium-dependent signals from VIP interneurons located in the right medial prefrontal (mPFC) and left auditory cortices (ACx) by implanting two 400 µm optical fibers at these locations (n=6 sessions from n=6 mice, Figure 1–figure supplement 1C).”

      “Raster plot of the trial-to-trial activation of the responsive VIP neurons in Hit and FA trials during the two-photon imaging sessions (n=18 sessions, n=16 mice, n=746 cells).”

      Subregional labels, for example on Figure 2, should be considered as additional information to orient the readers, even if they were very precisely defined on the basis of the coordinates. All analyses considering regional differences were conducted on the level of the main functional areas of the dorsal cortex (motor, somatosensory, parietal, and visual). Despite some location-dependent heterogeneity in the late response phase (Figures 2G and H), even these main dorsal cortical regions were all similar from the perspective of responsiveness to reinforcers and auditory cues.

      Reviewer #3 (Public Review):

      In this study Szadai et al. show reliable, relatively synchronous activation of VIP neurons across different areas of dorsal cortex in response to reward and punishment of mice performing an auditory discrimination task. The authors use both a relatively fast 2 photon imaging, as well as fiber photometry for some deeper areas. They cluster neurons according to their temporal response profiles and show that these profiles differ across areas and cortical depths. Task performance, running behavior and arousal are all related to VIP response magnitude, as has been previously shown.

      Methodologically, this paper is strong: the described imaging technique allows for fairly fast sampling rates, they sample VIP cells from many different areas and the analyses are sophisticated and touch on the most relevant points. The figures are of high quality.

      However, as the manuscript is now, the presentation could be clearer, the methods more complete and it is not clear whether their conclusions are entirely supported by the data.

      The main issue is that reinforcement and arousal are hard to distinguish in this study. It is well known that VIP activity is correlated with arousal. And it is fairly clear that the reinforcement they use in this study - air puffs to the eye, as well as water rewards - cause arousal. It is possible that the reinforcer responses they observe in VIP neurons throughout all areas merely reflect the increases in arousal caused by these behaviorally salient events. They do discuss this caveat (albeit not fully convincingly) and in their abstract even state that the arousal state was not predictive of reinforcer responses. However their data clearly shows the tight relationship of the VIP reinforcer responses to both arousal (as measured by pupil diameter), as well as running speed of the animal. Both of these variables are well known to be tightly coupled to VIP activity.

      Although barely mentioned, the authors do appear to sometimes present uncued reward (Figure S2F). If responses were noticeably different from the same events in the task context (as actual reinforcers) this could at least hint towards the reinforcement signal being distinct from mere arousal. However, this data is only mentioned in one supplementary figure in a different context (comparison with PV cells) and neither directly compared to cued reward, nor is this discussed at all. Were uncued air puffs also presented? How do the responses compare to cued air puffs/punishment?

      Our original approach to distinguish between reinforcement- and arousal-related responses aimed:

      1) to show that VIP cells with both low and high correlation coefficients with arousal produce large signals upon reinforcement presentation (Figure 3B),

      2) the high differences of low and high arousal changes were reflected in a limited way in the VIP activity (Figures 3C and D): as highlighted in Figure R1, where we also added bars to show ∆P/P in high and low pupil change conditions, the difference in ∆P/P is ~5-fold, while it is only ~1.5-fold for ∆F/F. This disproportionality suggests that a large part of the signal below the dashed blue line is independent of arousal. We have added these modifications to the new version of Figure 3 for clarity.

      Figure R1 = Figure 3C-D with modification. Comparison of pupil changes and corresponding calcium averages.

      We collected further evidence to support our claims. In Figure 3–figure supplement 2 we depicted Hit and FA trials in which the reinforcement didn’t elevate the arousal level any further. Many of these trials were associated with locomotion prior to the reinforcement, but it was also common that the animals remained still during the whole trial. Trials with increased locomotion upon reinforcement presentation were excluded. Reinforcement-related calcium signals were still present under these conditions, indicating that these signals are not simple reflections of arousal. Moreover, we estimate the distinct contributions of arousal, locomotion, and reinforcers in Figure 3–figure supplement 2D in a systematic way with a generalized linear model. This model also confirmed our view about the reinforcement-related coding.

      We now say in the results:

      “Finally, to assess the motor- and reinforcement-related contributions to VIP interneuronal activity, we built a generalized linear model using the behavior and imaging data of the SS and Mtr recordings (Figure 3–figure supplement 2D, n=3 mice). This model was able to explain 18.8 ± 11.1% of the variance of the VIP population calcium signal, and highlighted that arousal was the best predictor, followed by reward, punishment, locomotion velocity, and auditory cue (weights = 0.055, 0.031, 0.028, 0.020, 0.018 respectively; all predictors, except the auditory cue in the case of one animal, contributed significantly, p<0.001). These observations indicate that running and arousal changes alone cannot fully explain the recruitment of VIP interneurons by reinforcers.”

      We apologize for not describing the rational and the result from the uncued reward experiments. Briefly, while recording reinforcement related signals in auditory cortex in our task, we realized that the cue delivery, and the resulting purely sensory response could alter the measurement of the reward-related responses. Hence, in order to disentangle the reward and sensory-related responses, we presented the animals with simple, uncued reward and observed a similar and robust recruitment of VIP interneurons. Based on the same rational, we made similar measurement for PV neurons.

      We now say in the results:

      “We did not further analyze the FA responses in auditory cortex as those responses also had a sensory component linked to the white noise-like sound created by the air puff delivery. Because the cue delivery could prove as a confound to measure reward-mediated responses from VIP interneurons in auditory cortex (see also methods), we delivered random reward in separate sessions. Water droplets delivery recruited VIP interneurons in both auditory and medial prefrontal cortex in a similar fashion as water delivery during the discrimination task (Figure 2–figure supplement 1G). Like our single cell results, PV-expressing neuronal population in ACx did not show any significant change in activity upon similar random reward delivery (Figure 2–figure supplement 1G).”

      Regarding the difference between cued and uncued responses, we definitely agree with the reviewer that it is an important point. The goal of this manuscript is however to study how reward and punishment are being represented by VIP interneurons in cortex.

      The imaging method appears well suited for their task, however the improvements listed in table S1 make the method appear far superior to existing methods in many aspects. Published or preprinted papers with 2 photon imaging of VIP populations (eg. from Scanziani lab (Keller et al.), Carandini lab (Dipoppa et al.), deVries lab (Millman et al.), Adesnik lab (Mossing et al.), which use the much more common resonant scanning, seem to be able to image 4-7 layers at 4-8Hz with a good enough SNR and potentially bigger neuronal yield of approximately 100-200 VIP cells, depending on the field of view. While not every single cell in a volume would be captured by these studies, the only main advantage of the here-used technique appears to be the superior temporal resolution.

      We thank the reviewer for the positive comment and we agree that interpretation must be improved. We agree that the imaging methods in the papers listed above have good SNR and were proper to address the scientific questions that had arisen. As the reviewer points out, 3D-AOD imaging allows fast 3D measurement that cannot be achieved otherwise. We used these advantages to address the critical question of layer specificity in the response of VIP interneurons to reinforcer presentation (Figure 2–figure supplement 1F, but see also the new Figure 1–figure supplement 1B). Regarding the comparison and quantification of the factual advantages of AOD microscopy over other imaging methods, the reviewer and readers can refer to the methods section (3D AO microscopy), Table S1 and Szalay et al., 2016. We agree with the reviewer that one of the main advantages is the superior temporal resolution. The second main advantage is the improved SNR. This originates from the fact that the entire measurement time is spent on regions of interest; measurement of unnecessary background areas is not required. More specifically, SNR is improved even in the case of 2D imaging by the factor of:

      ((area of the entire frame )/(area of the recorded VIP cells))^0.5

      which is about (100)0.5=10 as VIP interneurons represent about 1% of the brain. We used this second advantage of AO scanning when we determined the activation ratio (e.g., see Figure 2D).

      As the resolution of single or a few action potentials is challenging in behaving mice labelled with the GCaMP6 sensor, any improvement in SNR will improve the detection threshold. The higher SNR achieved here improved the detection threshold, which also explains the relatively high activation ratio in our work.

      In the case of asynchronous activity patterns, there is negligible contribution of individual small neuropil structures to somatic activities because of the relatively high volume-ratio of a soma and a given small neuropil structure: this minimizes the error during ∆F/F calculation of somatic responses. However, reinforcement, arousal, and running can generate highly synchronous neuronal activities which can synchronize neuropil activity around a given soma and, therefore, effectively and systematically modulating the somatic ∆F/F responses. To avoid this error, we used a high NA objective with proper neuropil resolution and combined it with motion correction. The use of the high NA also decreased the total scanning volume to about 689 µm × 639 µm × 580 µm and, therefore, it limited the maximum number of VIP cells which could be recorded. It is also possible to use a low-NA objective with a much higher FOV and scanning volume and record over 1000 VIP cells, but the extension of the PSF along the z dimension is inversely and quadratically proportional to the NA of the objective, therefore neuropil resolution will be at least partially lost. In summary, using the high-NA Olympus objective we maximized the 2P resolution which, in combination with off-line motion artifact elimination, allowed precise recording of somatic signals without any neuropil contamination: this provided correct activation ratio values.

      Even though this is not mentioned at all, it certainly appears possible, that the accousto-optical scanning emits audible noise. In this case it would be good to know the frequency range and level of this background noise, whether there are auditory responses to the scanning itself and if it interferes with the performance of the animals in the auditory task in any way. If this is not the case, this should probably simply be mentioned for non-experts.

      While the name of the acousto-optical deflectors seems to refer to “acoustic noise”, these devices are driven in the range of 55-120 MHz, which is 3 orders of magnitude higher frequency than the hearing threshold of animals: mice don’t hear them. Moreover, we developed water-cooled AODs ten years ago which means that ventilators are also not required, therefore AOD-based scanning can be used with zero noise emission. In contrast, galvo, resonant, and piezo scanning work in the kHz frequency range, which is in the middle of the hearing range of mice. Moreover, these technologies can’t be used in a vacuum and the scanner is just a few tens of centimeters away from the mice, which means that acoustic noise can’t be canceled but can only be partially suppressed with white noise. We thank the reviewer for the helpful comment and have added one sentence about the absence of acoustic noise during acousto-optical scanning:

      “The deflectors are driven in the 55-120 MHz frequency range, therefore the noise emitted does not interfere with the auditory cues, as mice can’t hear it. This, in combination with the water cooling of the deflectors, makes the AOD-based scanning the quietest technology for in-vivo imaging.”

      The authors show a strong correlation between task performance (hit rate) and the response to the auditory cue on hit trials. Was there any other significant correlations of VIP cells' responses to other trial types? Was reinforcer response correlated to behavioral variables at all?

      We have not found any remarkable correlations between VIP cell activity and behavioral variables except the one mentioned above.

      For example, we tested discrimination rate (hit rate/FA rate) correlation with ∆F/Ftone in Hit trials, but this was not significant (R2=0.03, F=0.49, p=0.69), just like Hit rate vs. ∆F/Ftone in FA trials (R2=0.19, F=3.8, p=0.07), and discrimination rate vs. ∆F/Ftone in FA trials (R2=0.07, F=1.1, p=0.31).

    1. Author Response

      Reviewer #2 (Public Review):

      The manuscript by Carrasquilla and colleagues applied Mendelian Randomization (MR) techniques to study causal relationship of physical activity and obesity. Their results support the causal effects of physical activity on obesity, and bi-directional causal effects of sedentary time and obesity. One strength of this work is the use of CAUSE, a recently developed MR method that is robust to common violations of MR assumptions. The conclusion reached could potentially have a large impact on an important public health problem.

      Major comments:

      (1) While the effect of physical activity on obesity is in line with earlier studies, the finding that BMI has a causal effect on sedendary time is somewhat unexpected. In particular, the authors found this effect only with CAUSE, but the evidence from other MR methods do not reach statistical significance cutoff. The strength of CAUSE is more about the control of false positive, instead of high power. In general, the power of CAUSE is lower than the simple IVW method. This is also the case in this setting, of high power of exposure (BMI) but lower power of outcome (sedentary time) - see Fig. 2B of the CAUSE paper.

      It does not necessarily mean that the results are wrong. It's possible for example, by better modeling pleiotropic effects, CAUSE better captures the causal effects and have higher power. Nevertheless, it would be helpful to better understand why CAUSE gives high statistical significance while others not. Two suggestions here:

      (a) It is useful to visualize the MR analysis with scatter plot of the effect sizes of variants on the exposure (BMI) and outcome (sedentary time). In the plot, the variants can be colored by their contribution to the CAUSE statistics, see Fig. 4 of the CAUSE paper. This plot would help show, for example, whether there are outlier variants; or whether the results are largely driven by just a small number of variants.

      We agree and have now added a scatter plot of the expected log pointwise posterior density (ELPD) contributions of each variant to BMI and sedentary time, and the contributions of the variants to selecting either the causal model or the shared model (Figure 2-figure supplement 1 panel A). We identified one clear outlier variant (red circle) that we thus decided to remove before re-running the CAUSE analysis (panel B). We found that the causal effect of BMI on sedentary time remained of similar magnitude before and after the removal of this outlier variant (beta=0.13, P=6x10-4 and beta=0.13, P=3x10-5, respectively) (Supplementary File 1 and 2).

      We have added a paragraph in the Results section to describe these new findings:

      Lines 204-210: “We checked for outlier variants by producing a scatter plot of expected log pointwise posterior density (ELPD) contributions of the variants to BMI and sedentary time (Supplementary File 1), identifying one clear outlier variant (rs6567160 in MC4R gene) (Figure 2, Appendix 1—figure 2). However, the causal effect of BMI on sedentary time remained consistent even after removing this outlier variant from the CAUSE analysis (Supplementary File 1 and 2).”

      (b) CAUSE is susceptible to false positives when the value of q, a measure of the proportion of shared variants, is high. The authors stated that q is about 0.2, which is pretty small. However, it is unclear if this is q under the causal model or the sharing model. If q is small under the sharing model, the result would be quite convincing. This needs to be clarified.

      We thank the reviewer for a very relevant question. We have now clarified in the manuscript that all of the reported q values (~0.2) were under the causal model (lines 202-203). We applied the strict parameters for the priors in CAUSE in all of our analyses, which leads to high shared model q values (q=0.7-0.9). To examine whether our bidirectional causal findings for BMI and sedentary time may represent false positive results, we performed a further analysis to identify and exclude outlier variants, as described in our response to Question 7. I.e. we produced a scatter plot of expected log pointwise posterior density (ELPD) contributions of each variant to BMI and sedentary time, and the contributions of the variants to selecting either the causal model or the shared model (Supplementary Figure 2 panel A, shown above). We identified one clear outlier variant (red circle) that we thus removed (panel B), but the magnitude of the causal estimates was not affected by the exclusion of the variant (Supplementary File 1 and 2).

      (2) Given the concern above, it may be helpful to strengthen the results using additional strategy. Note that the biggest worry with BMI-sedentary time relation is that the two traits are both affected by an unobserved heritable factor. This hidden factor likely affects some behavior component, so most likely act through the brain. On the other hand, BMI may involve multiple tissue types, e.g. adipose. So the idea is: suppose we can partition BMI variants into different tissues, those acted via brain or via adipose, say; then we can test MR using only BMI variants in a certain tissue. If there is a causal effect of BMI on sedentary time, we expect to see similar results from MR with different tissues. If the two are affected by the hidden factor, then the MR analysis using BMI variants acted in adipose would not show significant results.

      While I think this strategy is feasible conceptually, I realize that it may be difficult to implement. BMI heritability were found to be primarily enriched in brain regulatory elements [PMID:29632380], so even if there are other tissue components, their contribution may be small. One paper does report that BMI is enriched in CD19 cells [PMID: 28892062], though. A second challenge is to figure out the tissue of origin of GWAS variants. This probably require fine-mapping analysis to pinpoint causal variants, and overlap with tissue-specific enhancer maps, not a small task. So I'd strongly encourage the authors to pursue some analysis along this line, but it would be understandable if the results of this analysis are negative.

      We thank the reviewer for a very interesting point to address. We cannot exclude the possibility of an unobserved heritable factor acting through the brain, and tissue-specific MR analyses would be one possible way to investigate this possibility. However, we agree with the reviewer that partitioning BMI variants into different tissues is not currently feasible as the causal tissues and cell types of the GWAS variants are not known. Nevertheless, we have now implemented a new analysis where we tried to stratify genetic variants into “brain-enriched” and “adipose tissue-enriched” groups, using a simple method based on the genetic variants’ effect sizes on BMI and body fat percentage.

      Our rationale for stratifying variants by comparing their effect sizes on BMI and body fat percentage is the following:

      BMI is calculated based on body weight and height (kg/m2) and it thus does not distinguish between body fat mass and body lean mass. Body fat percentage is calculated by dividing body fat mass by body weight (fat mass / weight * 100%) and it thus distinguishes body fat mass from body lean mass. Thus, higher BMI may reflect both increased fat mass and increased lean mass, whereas higher body fat percentage reflects that fat mass has increased more than lean mass.

      In case a genetic variant influences BMI through the CNS control of energy balance, its effect on body fat mass and body lean mass would be expected to follow the usual correlation between the traits in the population, where higher fat mass is strongly correlated with higher lean mass. In such a scenario, the variant would show a larger standardized effect size on BMI than on body fat percentage. In case a genetic variant more specifically affects adipose tissue, the variant would be expected to have a more specific effect on fat mass and less effect on lean mass. In such scenario, the variant would show a larger standardized effect size on body fat percentage than on BMI.

      We therefore stratified BMI variants into brain-specific and adipose tissue-specific variants by comparing their standardized effect sizes on BMI body body fat percentage. Of the 12,790 variants included in the BMI-sedentary time CAUSE analysis, 12,266 had stronger effects on BMI than on body fat percentage and were thus classified as “brain-specific”. The remaining 524 variants had stronger effects on body fat percentage than on BMI (“adipose tissue-specific”). To assess whether the stratification of the variants led to biologically meaningful groups, we performed DEPICT tissue-enrichment analyses. The analyses showed that the genes expressed near the “brain-specific” variants were enriched in the CNS (figure below, panel A), whereas the genes expressed near the “adipose tissue-specific” variants did not reach significant enrichment at any tissue, but the showed strongest evidence of being linked to adipocytes and adipose tissue (figure below, panel B).

      Figure legend: DEPICT cell, tissue and system enrichment bar plots for BMI-sedentary time analysis.

      Having established that the two groups of genetic variants likely represent tissue-specific groups, we re-estimated the causal relationship between BMI and sedentary time using CAUSE, separately for the two groups of variants. We found that the 12,266 “brain-specific” genetic variants showed a significant causal effect on sedentary time (P=0.003), but the effect was attenuated compared to the CAUSE analysis where all 12,790 variants (i.e. also including the 524 “adipose tissue-specific” variants) were included in the analysis (P=6.3.x10-4). The statistical power was much more limited for the “adipose tissue-specific” variants, and we did not find a statistically significant causal relationship between BMI and sedentary time using the 524 “adipose tissue-specific” variants only (P=0.19). However, the direction of the effect suggested the possibility of a causal effect in case a stronger genetic instrument was available. Taken together, our analyses suggest that both brain-enriched and adipose tissue-enriched genetic variants are likely to show a causal relationship between BMI and sedentary time, which would suggest that the causal relationship between BMI and sedentary time is unlikely to be driven by an unobserved heritable factor.

      Minor comments

      The term "causally associated" are confusing, e.g. in l32. If it's causal, then use the term "causal".

      We have now changed the term “causally associated” to “causal” throughout the manuscript.

      Reviewer #3 (Public Review):

      Given previous reports of an observational relationship between physical inactivity and obesity, Carrasquilla and colleagues aimed to investigate the causal relationship between these traits and establish the direction of effect using Mendelian Randomization. In doing so, the authors report strong evidence of a bidirectional causal relationship between sedentary time and BMI, where genetic liability for longer sedentary time increases BMI, and genetic liability for higher BMI causally increases sedentary time. The authors also give evidence of higher moderate and vigorous physical activity causally reducing BMI. However they do note that in the reverse direction there was evidence of horizontal pleiotropy where higher BMI causally influences lower levels of physical activity through alternative pathways.

      The authors have used a number of methods to investigate and address potential limiting factors of the study. A major strength of the study is the use of the CAUSE method. This allowed the authors to investigate all exposures of interest, in spite of a low number of suitable genetic instruments (associated SNPs with P-value < 5E-08) being available, which may not have been possible with the use of the more conventional MR methods alone. The authors were also able to overcome sample overlap with this method, and hence obtain strong causal estimates for the study. The authors have compared causal estimates obtained from other MR methods including IVW, MR Egger, the weighted median and weighted mode methods. In doing so, they were able to demonstrate consistent directions of effects for most causal estimates when comparing with those obtained from the CAUSE method. This helps to increase confidence in the results obtained and supports the conclusions made. This study is limited in the fact that the findings are not generalizable across different age-groups or populations - although the authors do state that similar results have been found in childhood studies. As the authors also make reference to, due to the nature of the BMI genetic instruments used, the findings of this study can only inform on the lifetime impact of higher BMI, and not the effect of a short-term intervention.

      The findings of this study will be of interest to those in the field of public health, and support current guidelines for the management of obesity.

      We thank the Reviewer for the valuable feedback and insights. We agree that the lack of generalizability of the findings across age groups and populations is an important limitation. We have now mentioned this in lines 341-342 of the manuscript:

      “The present study is also limited in the fact that the findings are not generalizable across different age-groups or populations.”

    1. Author Response

      Reviewer #1 (Public Review):

      As far as I can tell, the input to the model are raw diffusion data plus a couple of maps extracted from T2 and MT data. While this is ok for the kind of models used here, it means that the networks trained will not generalise to other diffusion protocols (e.g with different bvecs). This greatly reduces to usefulness of this model and hinders transfer to e.g. human data. Why not use summary measures from the data as an input. There are a number of rotationally invariant summary measures that one can extract. I suspect that the first layers of the network may be performing operations such as averaging that are akin to calculating summary measures, so the authors should consider doing that prior to feeding the network.

      We agree with the reviewer that using summary measures will make the tool less dependent on particular imaging protocols and more translatable than using rawdata as inputs. We have experimented using a set of five summary measures (T2, magnetization transfer ratio (MTR), mean diffusivity, mean kurtosis, and fractional anisotropy) as inputs. The prediction based on these summary measures, although less accurate than predictions based on rawdata in terms of RMSE and SSIM (Figure 2A), still outperformed polynomial fitting up to 2nd order. The result, while promising, also highlights the need for finding a more comprehensive collection of summary measures that match the information available in the raw data. Further experiments with existing or new summary measures may lead to improved performance.

      The noise sensitivity analysis is misleading. The authors add noise to each channel and examine the output, they do this to find which input is important. They find that T2/MT are more important for the prediction of the AF data, But majority of the channels are diffusion data, where there is a lot of redundant information across channels. So it is not surprising that these channels are more robust to noise. In general, the authors make the point that they not only predict histology but can also interpret their model, but I am not sure what to make of either the t-SNE plots or the rose plots. I am not sure that these plots are helping with understanding the model and the contribution of the different modalities to the predictions.

      We agree that there is redundant information across channels, especially among diffusion MRI data. In the revised manuscript, we focused on using the information derived from noise-perturbation experiments to rank the inputs in order to accelerate image acquisition instead of interpreting the model. We removed the figure showing t-SNE plots with noisy inputs because it does not provide additional information.

      Is deep learning really required here? The authors are using a super deep network, mostly doing combinations of modalities. is the mapping really highly nonlinear? How does it compare with a linear or close to linear mapping (e.e. regression of output onto input and quadratic combinations of input)? How many neurons are actually doing any work and how many are silent (this can happen a lot with ReLU nonlinearities)? In general, not much is done to convince the reader that such a complex model is needed and whether a much simpler regression approach can do the job.

      The deep learning network used in the study is indeed quite deep, and there are two main reasons for choosing it over simpler approaches.

      The primary reason to pick the deep learning approach is to accommodate complex relationships between MRI and histology signals. In the revised Figure 2A-B, we have demonstrated that the network can produce better predictions of tissue auto-fluorescence (AF) signals than 1st and 2nd order polynomial fitting. For example, the predicted AF image based on 5 input MR parameters shared more visual resemblance with the reference AF image than images generated by 1st and 2nd order polynomial fittings, which were confirmed by RMSE and SSIM values. The training curves shown in Fig. R1 below demonstrate that, for learning the relationship between MRI and AF signals, at least 10 residual blocks (~ 24 layers) are needed. Later, when learning the relationship between MRI and Nissl signals, 30 residual blocks (~64 layers) were needed, as the relationship between MRI and Nissl signals appears less straightforward than the relationship between MRI and AF/MBP/NF signals, which have a strong myelin component. In the revised manuscript, we have clarified this point, and the provided toolbox allows users to select the number of residual blocks based on their applications.

      Fig. R1: Training curves of MRH-AF with number of residual blocks ranging from 1 to 30 showing decreasing RMSEs with increasing iterations. The curves in the red rectangular box on the right are enlarged to compare the RMSE values. The training curves of 10 and 30 residual blocks are comparable, both converged with lower RMSE values than the results with 1 and 5 residual blocks.

      In addition, the deep learning approach can better accommodate residual mismatches between co-registered histology and MRI than polynomial fitting. Even after careful co-registration, residual mismatches between histology and MRI data can still be found, which pose a challenge for polynomial fittings. We have tested the effect of mismatch by introducing voxel displacements to perfectly co-registered diffusion MRI datasets and demonstrated that the deep learning network used in this study can handle the mismatches (Figure 1 – figure supplement 1).

      Relatedly, the comparison between the MRH approach and some standard measures such as FA, MD, and MTR is unfair. Their network is trained to match the histology data, but the standard measures are not. How does the MRH approach compare to e.g. simply combining FA/MD/MTR to map to histology? This to me would be a more relevant comparison.

      This is a good idea. We have added maps generated by linear fitting of five MR measures (T2, MTR, FA, MD, and MK) to MBP for a proper comparison. Please see the revised Figure 3A-B. The MRH approach provided better prediction than linear fitting of the five MR measures, as shown by the ROC curves in Figure 3C.

      • Not clear if there are 64 layers or 64 residual blocks. Also, is the convolution only doing something across channels? i.e. do we get the same performance by simply averaging the 3x3 voxels?

      We have revised the paragraph on the network architecture to clarify this point in Figure 1 caption as well as the Methods section. We used 30 residual blocks, each consists of 2 layers. There are additional 4 layers at the input and output ends, so we had 64 layers in total.

      The convolution mostly works across channels, which is what we intended as we are interested in finding the local relationship between multiple MRI contrasts and histology. With inputs from modified 3x3 patches, in which all voxels were assigned the same values as the center voxel, the predictions of MRH-AF did not show apparent loss in sensitivity and specificity, and the voxel-wise correlation with reference AF data remained strong (See Fig. R2 below). We think this is an important piece of information and added it as Figure 1 – figure supplement 3. Averaging the 3x3 voxels in each patch produced similar results.

      Fig. R2: Evaluation of MRH-AF results generated using modified 3x3 patches with 9 voxels assigned the same MR signals as the center voxel as inputs. A: Visual inspection showed no apparent differences between results generated using original patches and those using modified patches. B: ROC analysis showed a slight decrease in AUC for the MRH-AF results generated using modified patches (dashed purple curve) compared to the original (solid black curve). C: Correlation between MRH-AF using modified patches as inputs and reference AF signals (purple open circles) was slightly lower than the original (black open circles).

      The result in the shiverer mouse is most impressive. Were the shiverer mice data included in the training? If not, this should be mentioned/highlighted as it is very cool.

      Data from shiverer mice and littermate controls were not included in the training. We have clarified this point in the manuscript.

    1. Author Response

      Reviewer #1 (Public Review):

      This study used GWAS and RNAseq data of TCGA to show a link between telomere length and lung cancer. Authors identified novel susceptibility loci that are associated with lung adenocarcinoma risk. They showed that longer telomeres were associated with being a female nonsmoker and early-stage cancer with a signature of cell proliferation, genome stability, and telomerase activity.

      Major comments:

      1) It is not clear how are the signatures captured by PC2 specific for lung adenocarcinoma compared to other lung subtypes. In other words, why is the association between long telomeres specific to lung adenocarcinoma?

      We thank the reviewer for raising this point (similarly mentioned by reviewer #2). Indeed, it is unclear why genetically predicted LTL appears more relevant to lung adenocarcinoma. We have used LASSO approach to select important features of PC2 in lung adenocarcinoma and inferred PC2 in lung squamous cell carcinomas tumours to better explore the differences between histological subtypes. The new results are presented in Figure 5, as well as being described in the methods and results sections. In addition, we have expanded upon this point in the discussion with the following paragraph (page 11, lines 229-248):

      ‘An explanation for why long LTL was associated with increased risk of lung cancer might be that individuals with longer telomeres have lower rates of telomere attrition compared to individuals with shorter telomeres. Given a very large population of histologically normal cells, even a very small difference in telomere attrition would change the probability that a given cell is able to escape the telomere-mediated cell death pathways (24). Such inter-individual differences could suffice to explain the modest lung cancer risk observed in our MR analyses. However, it is not clear why longer TL would be more relevant to lung adenocarcinoma compared to other lung cancer subtypes. A suggestion may come from our observation that longer LTL is related to genomic stable lung tumours (such as lung adenocarcinomas in never smokers and tumours with lower proliferation rates) but not genomic unstable lung tumours (such as heavy smoking related, highly proliferating lung squamous carcinomas). One possible hypothesis is that histologic normal cells exposed to highly genotoxic compounds, such as tobacco smoking, might require an intrinsic activation of telomere length maintenance at early steps of carcinogenesis that would allow them to survival, and therefore, genetic differences in telomere length are less relevant in these cells. By contrast, in more genomic stable lung tumours, where TL attrition rate is more modest, the hypothesis related to differences in TL length may be more relevant and potentially explaining the heterogeneity in genetic effects between lung tumours (Figure 2). Alternately, we also note that the cell of origin may also differ, with lung adenocarcinoma is postulated to be mostly derived from alveolar type 2 cells, the squamous cell carcinoma is from bronchiolar epithelium cells (19), possibly suggesting that LTL might be more relevant to the former.

      2) The manuscript is lacking specific comparisons of gene expression changes across lung cancer subtypes for identified genes such as telomerase etc since all the data is presented as associations embedded within PCs.

      The genes associated with telomere maintenance such as TERT and TERC are very low expressed in these tumours (Barthel et al NG 2017). In this context, no sample has more than 5 normalised read counts by RNA-sequencing for TERT within TCGA lung cohorts (TCGA-LUSC, TCGA-LUAD). As such we have not explored the difference by individual telomere related genes. Nevertheless, we have explored an inferred telomerase activity gene signature, developed by Barthel et al and we did explore this in the context of lung adenocarcinoma tumours. We have added a note in the result section to inform the reader regarding why we did not directly test TERT/TERC expression (page 9, lines 184-187).

      3) It is not clear how novel are the findings given that most of these observations have been made previously i.e. the genetic component of the association between telomere length and cancer.

      Others, including ourselves, have studied TL and lung cancer. We have built on that on the most updated TL genetic instrument and the largest lung cancer study available. In addition, we provided insights into the possible mechanisms in which telomere length might affect lung adenocarcinoma development. Using colocalisation analyses, we reported novel shared genetic loci between telomere length and lung adenocarcinoma (MPHOSPH6, PRPF6, and POLI), such genes/loci that have not previously linked to lung adenocarcinoma susceptibility. For MPHOSPH6 locus, we showed that the risk allele of rs2303262 (missense variant annotated for MPHOSPH6 gene) colocalized with increased lung adenocarcinoma risk, lower lung function (FEV1 and FVC), and increased MPHOSPH6 gene expression in lung, as highlighted in the discussion section of the revised manuscript.

      In addition, we have used a PRS analysis to identify a gene expression component associated with genetically predicted telomere length in lung adenocarcinoma but not in squamous cell carcinoma subtype. The aspect of this gene expression component associated with longer telomere length are also associated with molecular characteristics related to genome stability (lower accumulation of DNA damage, copy number alterations, and lower proliferation rates), being female, early-stage tumours, and never smokers, which is an interesting but not completely understood lung cancer strata. As far as we are aware, this is the first time an association between a PRS related to an etiological factor, such as telomere length and a particular expression component in the tumour.

      We have adjusted the discussion further highlight the novel aspects in the discussion section of the revised manuscript.

      Reviewer #2 (Public Review):

      The manuscript of Penha et al performs genetic correlation, Mendelian randomization (MR), and colocalization studies to determine the role of genetically determined leukocyte telomere length (LTL) and susceptibility to lung cancer. They develop an instrument from the most recent published association of LTL (Codd et al), which here is based on n=144 genetic variants, and the largest association study of lung cancer (including ~29K cases and ~56K controls). They observed no significant genetic correlation between LTL and lung cancer, in MR they observed a strong association that persisted after accounting for smoking status. They performed colocalization to identify a subset of loci where LTL and lung cancer risk coincided, mainly around TERT but also other loci. They also utilized RNA-Seq data from TCGA lung cancer adenocarcinoma, noting that a particular gene expression profile (identified by a PC analysis) seemed to correlate with LTL. This expression component was associated with some additional patient characteristics, genome stability, and telomerase activity.

      In general, most of the MR analysis was performed reasonably (with some suggestions and comments below), it seems that most of this has been performed, and the major observations were made in previous work. That said, the instrument is better powered and some sub-analyses are performed, so adds further robustness to this observation. While perhaps beyond the scope here, the mechanism of why longer LTL is associated with (lung) cancer seems like one of the key observations and mechanistically interesting but nothing is added to the discussion on this point to clarify or refute previous speculations listed in the discussion mentioned here (or in other work they cite).

      Some broad comments:

      1) The observations that lung adenocarcinoma carries the lion's share of risk from LTL (relative to other cancer subtypes) could be interesting but is not particularly highlighted. This could potentially be explored or discussed in more detail. Are there specific aspects of the biology of the substrata that could explain this (or lead to testable hypotheses?)

      We thank the reviewer for these comments. A similar point was raised by reviewer #1. Please see our response above, as well as the additional analysis described in Figure 5 that considers the differences by histological subtype.

      2) Given that LTL is genetically correlated (and MR evidence suggests also possibly causal evidence in some cases) across a range of traits (e.g., adiposity) that may also associate with lung cancer, a larger genetic correlation analysis might be in order, followed by a larger set of multivariable MR (MVMR) beyond smoking as a risk factor. Basically, can the observed relationship be explained by another trait (beyond smoking)? For example, there is previous MR literature on adiposity measures, for example (BMI, WHR, or WHRadjBMI) and telomere length, plus literature on adiposity with lung cancer; furthermore, smoking with BMI. A bit more comprehensive set of MVMR analyses within this space would elevate the significance and interpretation compared to previous literature.

      Indeed, there are important effects related to BMI and lung cancer (Zhou et al., 2021. Doi:10.1002/ijc.33292; Mariosa et al., 2022. Doi: 10.1093/jnci/djac061). We have tested the potential for influence on our finding using MVMR, modelling LTL and BMI using a BMI genetic instrument of 755 SNPs obtained from UKBB (feature code: ukb-b-19953). This multivariate approach did not result any meaningful changes in the associations between LTL and lung cancer risk.

      3) In the initial LTL paper, the authors constructed an IV for MR analyses, which appears different than what the authors selected here. For example, Codd et al. proposed an n=130 SNP instrument from their n=193 sentinel variants, after filtering for LD (n=193 >>> n=147) and then for multi-trait association (n=147 >> n=130). I don't think this will fundamentally change the author's result, but the authors may want to confirm robustness to slightly different instrument selection procedures or explain why they favor their approach over the previous one.

      We appreciate the reviewer’s suggestion. Our study is designed for a Mendelian Randomization framework and chose to be conservative in the construction of our instrumental variable (IV). We therefore applied more stringent filters to the LTL variants relative to Codd et al’s approach. We applied a wider LD window (10MB vs. 1MB) centered around the LTL variants that were significant at genome-wide level (p<5e-08) and we restricted our analyses to biallelic common SNPs (MAF>1% and r2<0.01 in European population from 1000 genomes). Nevertheless, the LTL genetic instrument based on our study (144 LTL variants) is highly correlated with the PRS based on the 130 variants described by Codd et al. (correlation estimate=0.78, p<2.2e-16). The MR analyses based on the 130 LTL instrument described by Codd et al showed similar results to our study.

      4) Colocalization analysis suggests that a /subset/ of LTL signals map onto lung cancer signals. Does this mean that the MR relationships are driven entirely by this small subset, or is there evidence (polygenic) from other loci? Rather than do a "leave one out" the authors could stratify their instrument into "coloc +ve / coloc -ve" and redo the MR analyses.

      Mainly here, the goal is to interpret if the subset of signals at the top (looks like n=14, the bump of non-trivial PP4 > 0.6, say) which map predominantly to TERT, TERC, and OBFC1 explain the observed effect here. I.e., it is biology around these specific mechanisms or generally LTL (polygenicity) but exemplified by extreme examples (TERT, etc.). I appreciate that statistical power is a consideration to keep in mind with interpretation.

      We appreciate the reviewer’s comment and, indeed, we considered this idea. However, the analytical approach used the lung cancer GWAS to identify variants that colocalise. To validate this hypothesis that a subset of colocalised variants would be driving all the MR associations, we would need an independent lung cancer case control study to act as an out-of-sample validation set. This is not available to us at this point. Nevertheless, we slightly re-worded the discussion to highlight that the colocalised loci tend to be near genes related to telomere length biology and are also exploring the colocalisation approach to select variants for PRS analysis elsewhere.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors examine the role of the K700E mutation in the Sf3B1 splicing factor in PDAC and report that this Sf3B1 mutation promotes PDAC by decreasing sensitivity to TGF-b resulting in decreased EMT and decreased apoptosis as a result. They propose that the Sf3b1 K700E mutant causes decreased expression of Map3K7, a known mediator of TGF-β signaling and also known to be alternately spliced in other systems by the Sf3b1 K700E mutation. The role of splicing defects in cancer is relatively understudied and could identify novel targets for therapeutic intervention so this work is of potential significance. However, the data is over-interpreted in many instances and it is not clear the authors can make the claims they do based on the data shown. In particular, the data showing that decreased Map3k7 underlies the effects of the Sf3b1K700E mutant is very weak. Does over-expression of Map3k7 promote the EMT signature and induce apoptosis? Do the Map3k7 expressing organoids form tumors more effectively when transplanted into mice? Also, the novelty of the work is a concern since aberrant Map3k7 splicing due to SF3B1 mutation was seen previously in other systems. The authors also do not address the apparent conundrum of Sf3b1 K700E mutation promoting tumorigenesis despite there being less EMT which is also required for progression to metastasis in PDAC.

      Major Concerns.

      1) The analysis of the effect of Sf3b1K700E expression on normal pancreas and on PanINs in KC mice and PDAC in KPC mice is superficial and could be enhanced by staining for amylase, cytokeratin-19 and insulin. In particular, the data quantified in figure 1L should be accompanied by staining for CK19, Mucin5AC or some other marker of ductal transformation. Also, are any effects seen at older ages in normal mice?

      We performed staining of normal and cancerous mouse pancreata using Ck19, MUC5AC and b-amylase antibodies. In line with our hypothesis that Sf3b1K700E mainly plays a role in early stages of PDAC formation, we observed significant differences in CK19 (increase), MUC5AC (increase) and b-amylase (decrease) expression in early stage KPC-Sf3b1K700E vs. KPC tumors (Fig. 1G-J), but not in late stage tumors (see Figure 1-figure supplement 1F-I). In addition, no differences were observed in normal mice. We added these data to the revised manuscript (see Figure 1-figure supplement 1D, E).

      2) The invasion assays used are limited and should be complemented by more routine quantification of cell migration and invasion including such assays as a scratch assay, Boyden chamber assays and use of the IncuCyte system to quantify. As it stands the image in Figure 3B is difficult to interpret since it is very poorly described in the figure legend. Additional evidence is needed to make the claims made by the authors.

      During the revisions we performed wound healing/scratch assays using PANC-1 cells with inducible SF3B1 WT/K700E overexpression. We observed a significant difference in migratory capacity between SF3B1 WT- and SF3B1 K700E overexpressing cells stimulated with TGF-β. We added this data to the revised manuscript (Fig. 2I, J). We also describe the abovementioned figure 3B in more detail (revised manuscript Fig. 2G, H; line 759-767).

      3) The authors should show the actual CC3 staining quantified in Suppl. Figure 2G.

      We added a representative image of CC3 staining (see Figure 3-figure supplement 1A) for the quantified data (see Figure 3-figure supplement 1B in the revised manuscript).

      4) The graph in Figure 3L should show WT and Sf3b1K700E expressing organoids number both with and without TGF-b.

      Since without TGF-b supplementation organoids have to be split in a 1:3 ratio every 5 days, we could not follow the same passaging regimen as in experiments with TGF-b supplementation (split in a 1:2 ratio every 20 days, Fig. 3I). However, we assessed the organoid number grown in control medium without TGF-b for 4 passages (20 days) in a 1:3 ratio, and observe no difference in organoid number in WT and Sf3b1K700E expressing organoids (Author response image 1). In the revised manuscript we show with a highly quantitative read-out (CellTiterGlo) that Sf3b1K700E expressing organoids do not grow faster than Sf3b1 WT expressing organoids in absence of TGF-β (see Figure 3-figure supplement 1E). Taken together, we can exclude that Sf3b1K700E organoids outgrow Sf3b1 WT organoids in medium with TGF-β supplementation because they generally have a growth advantage.

      Author response image 1.

      Author response image 1. WT and Sf3b1K700E expressing organoids were cultured without TGF-β supplementation. Organoids were split in a 1:3 ratio every 5 days. Data points show organoid number before splitting, assessed for 4 passages.

      Reviewer #2 (Public Review):

      The manuscript has several areas of strength; it functionally explores a mutant that is detected in a portion of pancreatic cancers; it conducts mechanistic investigation and it uses human cell lines to validate the findings based on mouse models. Some areas for improvement are described below.

      1) TGF-b is known to act as a tumor suppressor early in carcinogenesis, and as a tumor promoter later. The authors should extend their analysis of mouse models to determine whether the effect of SF3B1K700E is specific to promoting initiation (e.g. more, early acinar ductal metaplasia) or faster progression of PanINs following their formation. Another way to address this could be acinar cultures, to determine whether an increased propensity to ADM exists.

      To further detangle the effect KPC-Sf3b1K700E with respect to tumor progression, we analyzed our autochthonous model at an early and late stage of tumor progression: Histological examination at 5 weeks revealed increased propensity to ADM (see Figure 1-figure supplement 1J, K), PanIN formation (shown by Muc5a1 and CK19 IF stainings, Fig. 1G, I, J) and a concomitant decrease of acinar cells (shown by b-amylase staining) in KPC-Sf3b1K700E vs. KPC tumors (Fig. 1G, H). Analyzing tumors at 9 weeks of age did not show differences in CK19 staining and fibrosis. We added these data to the revised manuscript (see Figure 1-figure supplement 1F-I).

      2) Given that the effect of SF3B1K700E expression is more prominent in KC mice, rather than in KPC mice, the authors should explain the rationale for using the latter for RNA sequencing.

      In KC mice, pre-invasive PanIN lesions only infrequently progress to PDAC (spontaneous progression, see Gabriel et al., Pancreatology, 2020 ). Therefore, it would have been difficult to collect enough material for cell sorting and downstream RNA sequencing of tumor cells. The KPC mouse model develops PDAC with a 100% penetrance, allowing the collection of sufficient material.

      3) Given that this mutation is found in about 3% of human pancreatic cancer, it would be interesting to know whether these tumors have any unique feature, and specifically any characteristic that could be harnessed therapeutically.

      Unfortunately, the size of published datasets is too small for a meaningful differential gene expression analysis of SF3B1-WT vs. SF3B1-K700E PDAC tumors (due to the low occurrence of SF3B1-K700E PDAC). However, harnessing the K700E mutation therapeutically by increasing missplicing through splicing inhibitors has previously been suggested, and it was shown that SF3B1-K700E mutated cancer cells are more prone to apoptosis when splicing is chemically targeted than SF3B1-WT cells. We tested a similar approach in murine pre-cancerous organoids, demonstrating that Sf3b1-WT organoids show higher survival than Sf3b1K700E expressing organoids when treated with the splicing-inhibitor Pladienolide B (Author response image 2). However, since this concept is not novel and not within the topic of our manuscript, we would prefer to not integrate this data into our manuscript.

      Author response image 2.

      Author response image 2. 33 nM of the splicing inhibitor Pladienolide B was added to the cell culture medium for 48 hours and the viability was assessed by normalizing organoid numbers to untreated control organoids. The line indicates WT and Sf3b1K700E organoids assessed in the same replicate.

      4) It would be interesting to know whether this mutation mutually exclusive to other mutations affecting response to TGF-b. Further, while the data might not be widely available, it would be interesting to know whether in human patients the mutation occurs in precursor lesions (PanIN might be difficult to assess, but IPMN might be doable) or at later stages.

      We performed a mutual exclusivity analysis in PDAC samples available at www.cbioportal.org, but did not find mutual exclusivity of SF3B1-K700E to genes of the TGF-β-pathway. Of note, the value of the analysis is limited by the small sample size of SF3B1-K700E PDAC (n=7) Moreover, to our knowledge there is no public tissue biobank for PDAC which would allow us to assess the stage of SF3B1-K700E mutated PDAC tumors. Thus, unfortunately we cannot histologically assess if the mutations already occur in early stages of human tumor development.

      Author response table 1.

      Author response table 1: Mutual exclusivity analysis of public PDAC databases (ICGC, CPTAC, QCMG, TCGA, UTSW), including 910 patients. Mutation frequency is 25% for SMAD4, 5% for TGF-ΒR2, 3% for SMAD2, 2.6% for TGF-ΒR1, 1.4% for SMAD3, 0.7% for SF3B1-K700E, 0.7% for TGF-ΒR3, 0.4% for SMAD1. Analysis was performed on cbioportal.org.

      Reviewer #3 (Public Review):

      Alternative splicing as a result of mutations in different components of the splicing machinery has been associated with a variety of cancer types, including hematological malignancies where this has been most extensively studied but also for solid tumors such as breast and pancreatic ductal adenocarcinoma (PDAC). Here the authors analyze genome sequencing data in human PDAC samples and identify a recurring mutation in the SF3B1 subunit that substitutes lysine for glutamate at residue 700 (SF3B1K700E) in PDACs. This mutation has been identified and its' molecular role in disease progression in other diseases has been studied, but the mechanism for promoting disease progression in pancreatic cancer has not been as well characterized.

      To study how SF3B1K700E contributes to PDAC pathology, the authors generate a novel genetically modified mouse model of a pancreas specific SF3B1K700E mutation and explore its oncogenicity and tumor promoting potential. The authors find that SF3B1K700E is not oncogenic, but potentiates the oncogenic potential of Kras and p53 (KP) driver mutations commonly found in PDAC tumors. The authors then proceed to characterize the molecular mechanisms that might drive this phenotype. By transcriptomic analysis, the authors find KP-SF3B1K700E tumors have downregulation of epithelial-to-mesenchymal transition (EMT) genes compared to KP tumors. The cytokine TGFβ has previously been found to limit PDAC initiation and progression by causing lethal EMT in PDAC and PDAC precursor cells. Thus, the authors propose SF3B1K700E inhibition of EMT blocks the tumor suppressive activity of TGFβ and this underpins the tumor promoting role of SF3B1K700E mutation in PDAC. Consistent with this finding, SF3B1K700E mutation blocks TGFβ-induced toxicity in a variety of cell culture models of PDAC and PDAC precursor models.

      Lastly, the authors seek to identify how altered splicing reduces EMT activity in PDAC cells. The authors identify misspliced genes consistent in both KP and human SF3B1K700E mutant cancer samples and find Map3k7 as one of 11 consistently misspliced genes. MAP3K7 has previously been identified as a positive regulator of EMT. Thus the authors speculated Map3k7 missplicing would lead to reduced MAP3K7 activity and a reduction EMT and that this underpins the TGFβ in SF3B1K700E mutant PDAC cells. Consistent with this, the authors find inhibition of MAP3K7 reduces TGFβ toxicity in SF3B1K700E WT cells and overexpression of MAP3K7 in SF3B1K700E mutant PDAC cells induces TGFβ toxicity. Altogether, this suggests activity of Map3k7 is responsible for altered EMT activity and TGFβ sensitivity in SF3B1K700E mutant PDAC.

      Altogether, the authors generate a valuable model to study the role of a recurring splicing mutation in PDAC and provide compelling evidence that this mutation is accelerates disease. The authors then perform both: (1) an open-ended investigation of how this mutation alters PDAC cell biology where they identify altered EMT activity and (2) rigorous mechanistic studies showing suppressed EMT provides PDAC cells with resistance to TGFβ, which has previously been shown to be tumor suppressive in PDAC, suggesting a possible mechanism by which SF3B1K700E mutation is oncogenic in PDAC that future animal studies can confirm. This work generates valuable models and datasets to advance the understanding of how mutations in the splicing machinery can promote PDAC progression and suggests alternative splicing of MAP3K7 is one such possible mechanism that altered splicing promotes PDAC progression in vivo.

      • One major concern about the manuscript is that the proposed mechanism by which SF3B1K700E mutation accelerates PDAC progression (MAP3K7 inhibition -> EMT inhibition -> reduced TGF-β toxicity) is only tested in ex vivo culture models and there is very limited and correlative data to suggest that this is the operative mechanism by which SF3B1K700E mutant tumors are accelerated. This is especially important because of recent findings that IFN-α signaling, which the authors also found to be high in SF3B1K700E mutant tumors, also promotes PDAC progression (https://www.biorxiv.org/content/10.1101/2022.06.29.497540v1). Thus, while thoroughly convinced by the rigorous ex vivo work that SF3B1K700E does lead to MAP3K7 inhibition -> EMT inhibition -> reduced TGF-β toxicity, further experiments to confirm this mechanism is critical in vivo would be needed to convince me that this mechanism is critical to tumor progression in vivo. For example, would forced expression of MAP3K7 slow orthotopic KP-SF3B1K700E tumor growth while leaving IFN-α signaling unperturbed?

      We thank the reviewer for raising these important points. To first test if the upregulation of IFN-α signaling, seen in our RNA-seq data of sorted KPC-Sf3b1K700E cells, was directly caused by the Sf3b1-K700E mutation, we assessed the 5 most deregulated genes of the IFN-α signature in in-vitro activated KPC and KPC-Sf3b1K700E organoids (analogous to the experiments on the EMT gene signature in see Figure 2-figure supplement 1D). However, in contrast to EMT marker genes, INFa signature genes were not differently expressed in KPC-Sf3b1K700E vs. KPC organoids (Author response image 3). Thus, increased IFN-α signaling in KPC-Sf3b1K700E tumors in mice is likely an indirect consequence of further progressed cancers rather than an effect directly caused by Sf3b1K700E mediated missplicing.

      Author response image 3.

      Author response image 3. Expression of the 5 most deregulated genes of the IFN-α gene set identified in sorted KPC-Sf3b1K700E cells in in-vitro activated KPC-Sf3b1K700E and KPC organoids. 4 biological replicates were performed. For analysis, Ct-values of the indicated genes were normalized to Actb and a two-tailed unpaired t-test was used to compute the indicated p-values.

      To next examine the effect of Map3k7 on tumors in vivo, we established orthotopic transplantation models with KPC and KPC-Sf3b1K700E cells, with overexpression or knockdown of Map3k7 (Author response image 4). However, in contrast to the autochthonous mouse model, already orthotopically transplanted KPC vs. KPC-Sf3b1K700E cells did not show differences in tumor size (see Figure 1-figure supplement 1M, N). These data support our hypothesis that Sf3b1-K700E rather plays an important role during early stages of PDAC (KPC cells are isolated from fully developed PDAC tumors and orthotopic KPC transplantation thus represents a late-stage PDAC model).

      Unfortunately, these data also demonstrate that orthotopic transplantation of KPC cells is not a suitable model for studying the impact of Map3k7 in PDAC development, and as expected, neither Map3k7 overexpression in transplanted KPC-Sf3b1K700E cells nor shRNA mediated knockdown of Map3k7 (shMap3k7) in transplanted KPC cells led to differences in growth compared to their control groups (Author response image 4). In line with these results, the EMT genes that were found to be differentially expressed in our autochthonous mouse model (KPC vs. KPC-Sf3b1K700E) were expressed at similar levels upon Map3K7 downregulation or overexpression.

      Since establishment of an autochthonous KPC PDAC mouse model with a knock-down of MAP3K7 is out of scope for a revision, in the revised manuscript we discuss the limitation of our study that the molecular link between Sf3b1K700E, Map3k7 and Tgfb resistance has only been studied in vitro in organoids and cell lines. We also adapted the abstract and the title of the manuscript accordingly (formerly “Mutant SF3B1 promotes PDAC malignancy through TGF-β resistance”, now “Mutant SF3B1 promotes malignancy in PDAC”).

      Author response image 4.

      Author response image 4. (A) Relative gene expression of Map3k7 in KPC cells transduced with shRNA targeting Map3k7 (shMap3k7), normalized to KPC cells transduced with scrambled control shRNA (shCtrl). 3 biological replicates are shown. (B) Weight of tumors derived by orthotopical transplantation of shMap3k7 and shCtrl KPC cells. 5 biological replicates are shown. (C) Relative gene expression of EMT genes in tumors derived by orthotopic transplantation of shCtrl and shMap3k7 cells. 4 biological replicates are shown. (D) Relative gene expression of Map3k7 in KPC-Sf3b1K700E cells transduced with an overexpression vector of Map3k7 (OE Map3k7), normalized to control KPC cells without Map3k7 overexpression. 3 biological replicates are shown, a two-sided student’s t-test was used to calculate significance. (E) Weight of tumors derived by orthotopical transplantation of Map3k7 overexpressing KPC-Sf3b1K700E cells (n=5) and control KPC-Sf3b1K700E cells (n=4). (F) Relative gene expression of EMT genes in tumors derived by orthotopic transplantation of KPC-Sf3b1K700E cells with- and without overexpression of Map3k7. 4 biological replicates are shown. A two-sided student’s t-test was used to calculate significance in Fig. 2A-F.

    1. Author Response:

      Reviewer #1:

      Zappia et al investigate the function of E2F transcriptional activity in the development of Drosophila, with the aim of understanding which targets the E2F/Dp transcription factors control to facilitate development. They follow up two of their previous papers (PMID 29233476, 26823289) that showed that the critical functions of Dp for viability during development reside in the muscle and the fat body. They use Dp mutants, and tissue-targetted RNAi against Dp to deplete both activating and repressive E2F functions, focussing primarily on functions in larval muscle and fat body. They characterize changes in gene expression by proteomic profiling, bypassing the typical RNAseq experiments, and characterize Dp loss phenotypes in muscle, fat body, and the whole body. Their analysis revealed a consistent, striking effect on carbohydrate metabolism gene products. Using metabolite profiling, they found that these effects extended to carbohydrate metabolism itself. Considering that most of the literature on E2F/Dp targets is focused on the cell cycle, this paper conveys a new discovery of considerable interest. The analysis is very good, and the data provided supports the authors' conclusions quite definitively. One interesting phenotype they show is low levels of glycolytic intermediates and circulating trehalose, which is traced to loss of Dp in the fat body. Strikingly, this phenotype and the resulting lethality during the pupal stage (metamorphosis) could be rescued by increasing dietary sugar. Overall the paper is quite interesting. It's main limitation in my opinion is a lack of mechanistic insight at the gene regulation level. This is due to the authors' choice to profile protein, rather than mRNA effects, and their omission of any DNA binding (chromatin profiling) experiments that could define direct E2F1/ or E2F2/Dp targets.

      We appreciate the reviewer’s comment. Based on previously published chromatin profiling data for E2F/Dp and Rbf in thoracic muscles (Zappia et al 2019, Cell Reports 26, 702–719) we discovered that both Dp and Rbf are enriched upstream the transcription start site of both cell cycle genes and metabolic genes (Figure 5 in Zappia et al 2019, Cell Reports 26, 702–719). Thus, our data is consistent with the idea that the E2F/Rbf is binding to the canonical target genes in addition to a new set of target genes encoding proteins involved in carbohydrate metabolism. We think that E2F takes on a new role, and rather than being re-targeted away from cell cycle genes. We agree that the mechanistic insight would be relevant to further explore.

      Reviewer #2:

      The study sets out to answer what are the tissue specific mechanisms in fat and muscle regulated by the transcription factor E2F are central to organismal function. The study also tries to address which of these roles of E2F are cell intrinsic and which of these mechanisms are systemic. The authors look into the mechanisms of E2F/Dp through knockdown experiments in both the fat body* (see weakness) and muscle of drosophila. They identify that muscle E2F contributes to fat body development but fat body KD of E2F does not affect muscle function. To then dissect the cause of adult lethality in flies, the authors proteomic and metabolomic profiling of fat and muscle to gain insights. While in the muscle, the cause seems to be an as of yet undetermined systemic change , the authors do conclude that adult lethality in fat body specific Dp knockdown is the result of decrease trehalose in the hemolymph and defects in lipid production in these flies. The authors then test this model by presenting fat body specific Dp knockdown flies with high sugar diet and showing adult survival is rescued. This study concurs with and adds to the emerging idea from human studies that E2F/Dp is critical for more than just its role in the cell-cycle and functions as a metabolic regulator in a tissue-specific manner. This study will be of interest to scientists studying inter-organ communication between muscle and fat.

      The conclusions of this paper are partially supported by data. The weaknesses can be mitigated by specific experiments and will likely bolster conclusions.

      1) This study relies heavily on the tissue specificity of the Gal4 drivers to study fat-muscle communication by E2F. The authors have convincingly confirmed that the cg-Gal4 driver is never turned on in the muscle and vice versa for Dmef2-Gal4. However, the cg-Gal4 driver itself is capable of turning on expression in the fat body cells and is also highly expressed in hemocytes (macrophage-like cells in flies). In fact, cg-Gal4 is used in numerous studies e.g.:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4125153/ to study the hemocytes and fat in combination. Hence, it is difficult to assess what contribution hemocytes provide to the conclusions for fat-muscle communication. To mitigate this, the authors could test whether Lpp-Gal4>Dp-RNAi (Lpp-Gal4 drives expression exclusively in fat body in all stages) or use ppl-Gal4 (which is expressed in the fat, gut, and brain) but is a weaker driver than cg. It would be good if they could replicate their findings in a subset of experiments performed in Figure 1-4.

      This is indeed an important point. We apologize for previously not including this information. Reference is now on page 7.

      Another fat body driver, specifically expressed in fat body and not in hemocytes, as cg-GAL4, was tested in previous work (Guarner et al Dev Cell 2017). The driver FB-GAL4 (FBti0013267), and more specifically the stock yw; P{w[+mW.hs]=GawB}FB P{w[+m*] UAS-GFP 1010T2}#2; P{w[+mC]=tubP-GAL80[ts]}2, was used to induce the loss of Dp in fat body in a time-controlled manner using tubGAL80ts. The phenotype induced in larval fat body of FB>DpRNAi,gal80TS recapitulates findings related to DNA damage response characterized in both Dp -/- and CG>Dp- RNAi (see Figure 5A-B, Guarner et al Dev Cell 2017). The activation of DNA damage response upon the loss of Dp was thoroughly studied in Guarner et al Dev Cell 2017. The appearance of binucleates in cg>DpRNAi is presumably the result of the abnormal transcription of multiple G2/M regulators in cells that have been able to repair DNA damage and to resume S-phase (see discussion in Guarner et al Dev Cell 2017). More details regarding the fully characterized DNA damage response phenotype were added on page 6 & 7 of manuscript.

      Additionally, r4-GAL4 was also used to drive Dp-RNAi specifically to fat body. But since this driver is weaker than cg-GAL4, the occurrence of binucleated cells in r4>DpRNAi fat body was mild (see Figure R1 below).

      As suggested by the reviewer, Lpp-GAL4 was used to knock down the expression of Dp specifically in fat body. All animals Lpp>DpRNAi died at pupa stage. New viability data were included in Figure 1-figure supplement 1. Also, larval fat body were dissected and stained with phalloidin and DAPI to visualize overall tissue structure. Binucleated cells were present in Lpp>DpRNAi fat body but not in the control Lpp>mCherry-RNAi (Figure 2-figure supplement 1B). These results were added to manuscript on page 7.

      Furthermore, Dp expression was knockdowned using a hemocyte-specific driver, hml-GAL4. No defects were detected in animal viability (data not shown).

      Thus, overall, we conclude that hemocytes do not seem to contribute to the formation of binucleated-cells in cg>Dp-RNAi fat body.

      Finally, since no major phenotype was found in muscles when E2F was inactivated in fat body (please see point 3 for more details), we consider that the inactivation E2F in both fat body and hemocytes did not alter the overall muscle morphology. Thus, exploring the contribution of cg>Dp-RNAi hemocytes in muscles would not be very informative.

      2) The authors perform a proteomics analysis on both fat body and muscle of control or the respective tissue specific knockdown of Dp. However, the authors denote technical limitations to procuring enough third instar larval muscle to perform proteomics and instead use thoracic muscles of the pharate pupa. While the technical limitations are understandable, this does raise a concern of comparing fat body and muscle proteomics at two distinct stages of fly development and likely contributes to differences seen in the proteomics data. This may impact the conclusions of this paper. It would be important to note this caveat of not being able to compare across these different developmental stage datasets.

      We appreciate the suggestion of the reviewer. This caveat was noted and included in the manuscript. Please see page 11.

      3) The authors show that the E2F signaling in the muscle controls whether binucleate fat body nuclei appear. In other words, is the endocycling process in fat body affected if muscle E2F function is impaired. However, they conclude that imparing E2F function in fat does not affect muscle. While muscle organization seems fine, it does appear that nuclear levels of Dp are higher in muscles during fat specific knock-down of Dp (Figure 1A, column 2 row 3, for cg>Dp-RNAi). Also there is an increase in muscle area when fat body E2F function is impaired. This change is also reflected in the quantification of DLM area in Figure 1B. But the authors don't say much about elevated Dp levels in muscle or increased DLM area of Fat specific Dp KD. Would the authors not expect Dp staining in muscle to be normal and similar to mCherry-RNAi control in Cg>dpRNAi? The authors could consider discussing and contextualizing this as opposed to making a broad statement regarding muscle function all being normal. Perhaps muscle function may be different, perhaps better when E2F function in fat is impaired.

      The overall muscle structure was examined in animals staged at third instar larva (Figure 1A-B). No defects were detected in muscle size between cg>Dp-RNAi animals and controls. In addition, the expression of Dp was not altered in cg>Dp-RNAi muscles compared to control muscles. The best developmental stage to compare the muscle structure between Mef2>Dp-RNAi and cg>Dp-RNAi animals is actually third instar larva, prior to their lethality at pupa stage (Figure 1- figure supplement 1).

      Based on the reviewer’s comment, we set up a new experiment to further analyze the phenotype at pharate stage. However, when we repeated this experiment, we did not recover cg>Dp-RNAi pharate, even though 2/3 of Mef2>Dp-RNAi animals survived up to late pupal stage. We think that this is likely due to the change in fly food provider. Since most cg>DpRNAi animals die at early pupal stage (>75% animals, Figure 1-figure supplement 1), pharate is not a good representative developmental stage to examine phenotypes. Therefore, panels were removed.

      Text was revised accordingly (page 6).

      4) In lines 376-380, the authors make the argument that muscle-specific knockdown can impair the ability of the fat body to regulate storage, but evidence for this is not robust. While the authors refer to a decrease in lipid droplet size in figure S4E this is not a statistically significant decrease. In order to make this case, the authors would want to consider performing a triglyceride (TAG) assay, which is routinely performed in flies.

      Our conclusions were revised and adjusted to match our data. The paragraph was reworded to highlight the outcome of the triglyceride assay, which was previously done. We realized the reference to Figure 6H that shows the triglyceride (TAG) assay was missing on page 17. Please see page 17 and page 21 of discussion.

    1. Author Respones

      Reviewer #1 (Public Review):

      The manuscript by Hekselman et al presents analyses linking cell-types to monogenic disorders using over-expression of monogenic disease genes as the signal. The manuscript analyses data from 6 tissues (bone marrow, lung, muscle, spleen, tongue and trachea) together with ~1,000 rare diseases from OMIM (with ~2,000 associated genes) to identify cell-type of interest for specific disease of choice. The signal used by the approach is the relative expression of OMIM-genes in a particular cell type relative to the expression of the gene in the tissue of interest identifying celltype-disease pairs that are then investigated through literature review and recapitulated using mouse expression. A potentially interesting finding is that disease genes manifesting in multiple tissues seem to hit same cell-types. Overall this important study combines multiple data analyses to quantify the connection between cell types and human disorders. However whereas some of the analyses are compelling, the statistical analyses are incomplete as they don't provide full treatment of type I error.

      Statistical analyses were changed to include permutation testing and a different threshold (Results, page 6, 1st paragraph; Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’; Figure 1–figure supplement 2). Assessments of type I error were based on literature text-mining and expert curation, and showed that false-positive rates were low in both (0.01 and 0.07, respectively; Figure 1F and Figure 1–figure supplement 4A).

      Reviewer #2 (Public Review):

      This study identifies 110 disease-affected cell types for 714 Mendelian diseases, based on preferential expression of known disease-associated genes in single-cell data. It is likely that many or most of the results are real, and the results are biologically interesting and provide a valuable resource. However, updates to the method are needed to ensure that inference of statistical significance is appropriately stringent and rigorous.

      Strengths: a systematic evaluation of disease-affected cell types across Mendelian diseases is a valuable addition to the literature, complementing systematic evaluations of common disease and targeted analyses of individual Mendelian diseases. The validation via excess overlap with diseasecell type pairs from literature co-appearance provides compelling evidence that many or most of the results are real. In addition, many of the results are biologically interesting. In particular, it is interesting that diseases with multiple affected tissues tend to affect similar cell types in the respective tissues.

      Limitations: the main limitation of the study is that, although many or most of the results are likely to be real, the criteria for statistical significance is probably not stringent enough, and is not welljustified. For diseases with only 1 disease-associated gene, the threshold is a z-score>2 for preferential expression in the cell type, but this threshold is likely to be often exceeded by chance. (For diseases with many disease-associated genes, the threshold is a median (across genes) zscore>2 for preferential expression in the cell type, which is less likely to occur by chance but still an arbitrary threshold.) Thus, there is a good chance that a sizable proportion of the reported disease-affected cell types might be false positives. The best solution would be to assess statistical significance via empirical comparison with results for non-disease-associated control genes, and assess the statistical significance of the resulting P-values using FDR.

      We thank the reviewer for the valuable insights and suggestions. We revised the method to assess statistical significance by using empirical comparison followed by FDR correction, as suggested by the reviewer (Results, page 6, 1st paragraph; Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’; Figure 1–figure supplement 2).

      The re-analysis using mouse single-cell data adds an interesting additional dimension to the study, with the small caveat that mouse single-cell data does not provide statistically independent information across genes (for the same reason that adding data from independent human individuals would not provide statistically independent information across genes, given that human and mouse expression are partially correlated).

      We acknowledge this caveat in the text (Discussion, page 17, 2nd paragraph, lines 8-11).

      Reviewer #3 (Public Review):

      The authors describe the method, PrEDiCT, which helps identify disease affected cell types based on gene sets. As I understand it, the method is based on finding which "disease genes" (from an annotation) are relatively highly expressed. The idea is nice, however, I have concerns about how "significance" is assessed and the relative controls.

      Overall, I find the idea interesting, but the execution raises some concerns.

      1) From a causal perspective, there is an association of high expression of these genes within these cell types, but without also assessing individuals with those specific diseases, I do not it is fair to say "disease affected" cell types. It is possible that these genes might behave completely fine but are highly expressed in those cell types while being affected another in other cell types.

      We agree with the reviewer. We changed the terminology to "likely disease-affected cell types” and added this caveat to the Discussion, page 16, 2nd paragraph.

      2) It is unclear to me what the "null" comparison is in the method and if there is one. For example, by chance, would I expect this gene to be highly expressed because other genes are also highly expressed in this cell type? Some way to assess "significance" or "enrichment" beyond simply using ranks and thresholds would be helpful in deciding whether these associations are robust.

      We revised the procedure for assessing statistical significance to include permutation tests. Specifically, given a disease D with n disease-associated genes, the null hypothesis was that the PrEDiCT score of these genes is not significantly different from the PrEDiCT score of a random set of n genes. To test this, we randomly selected n genes expressed in any cell type, and computed the PrEDiCT score for this random gene set in each cell type of the disease-affected tissue (referred to as ‘random score’). We repeated this procedure 1,000 times, resulting in 1,000 random scores per disease and cell type. The p-value of the PrEDiCT score of disease D in cell type c was set to the fraction of random scores in c that were at least as high as the original PrEDiCT score of D in c. The acquired p-values were adjusted for multiple hypothesis testing per disease using the Benjamini-Hochberg procedure. To increase stringency, we treated only statistically significant disease–cell-type pairs with PrEDiCT score≥1 as 'likely affected'. The procedure is detailed in Results, page 6, 1st paragraph; Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’; Figure 1–figure supplement 2. Additionally, we estimated type I error by using literature text-mining or expert curation (Results, page 7, 2nd paragraph; Methods, page 22, ‘Textmining of PubMed records’, and page 23, ‘Expert curation and assessment of disease-affected cell types’; Figure 1F and Figure 1–figure supplement 4A).

      3) Additionally, it is unclear to me, but I suspect that there are unequal cell numbers in the scores computed as well as between relevant tissues. This is related to point (2) above, but as a result, the estimates of the scores will inherently have different variances, thus making comparisons between them difficult/unreliable unless accounted for. If I understand correctly, the score is first the average expression within a tissue, then, the Z-score? If so, my comment applies.

      To clarify, the PrEDiCT score of a disease D in cell type c was set to the median preferential expression P of its disease genes (Equation 1 below). The preferential expression of each gene in c was computed as a Z-score, by comparing the average expression of the gene in c to its average expression in all cell types of the tissue, divided by the standard deviation (SD, Equation 2 below). Tissues indeed had unequal numbers of cell types, however, the distribution of PrEDiCT scores were similar between tissues (now in Supplementary File 13). We revised this part of Methods and added Equations 1 and 2 (Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’) and Supplementary File 13.

      4) There is a large set of work done in gene enrichment sets which appears to not be mentioned (e.g. GSEA and other works by the Price group). It would be helpful for the authors to summarize these methods and how their method differs.

      We added work done in gene enrichment sets (including two relevant and recent studies from the Price group) and summarized these methods in the Introduction (page 2-3).

      5) Additionally, it should be noted that a caveat of this analysis is that the comparisons are all done only relative to the cell types sampled and the diseases which have Mendelian genes associated with them. I would expect these results to change, possibly drastically, if the sampled cell types and diseases were to be changed.

      We agree with the reviewer and now discuss the generalizability of our results, relating to the extent of the sampled cell types (Discussion, page 18, 1st paragraph).

      6) Finally, I would appreciate a more detailed explanation in the methods of how the score is computed. Some equations and the data they are calculated from would be helpful here.

      We now provide a detailed explanation of how the score and its statistical significance were computed and added Equations 1 and 2 (Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’).

      In summary, the general idea is an interesting one, but I do think the issues above should be addressed to make the results convincing.

      We thank the reviewer for the important feedback which helped us strengthen our analyses.

    1. Author Response:

      Reviewer #1:

      Chen et al. trained male and female animals on an explore/exploit (2-armed bandit) task. Despite similar levels of accuracy in these animals, authors report higher levels of exploration in males than in females. The patterns of exploration were analyzed in fine-grained detail: males are less likely to stop exploring once exploring is initiated, whereas female mice stop exploring once they learn. Authors find that both learning rate (alpha) and noise parameter (beta) increase in exploration trials in a hidden Markov model (HMM). When reinforcement learning (RL) models were fitted to animal data, they report females had a higher learning rate and over days of testing, suggesting higher meta-learning in females. They also report that of the RL models they fit, the model incorporating a choice kernel updating rule was found to fit both male and female learning. The results do suggest one should pay greater attention to the influence of sex in learning and exploration. Another important takeaway from this study is that similar levels of accuracy do not imply similar strategies. Essential revisions include a request to show more primary behavioral data, to provide a rationale for the different RL models and their parameters, to clarify the difference between learning and 'steady state,' and to qualify how these experiments uniquely identify latent cognitive variables not previously explored with similar methods.

      We appreciate the reviewer’s thorough reading of the paper and hope that the changes we detail below will address these concerns.

      Reviewer #2:

      The authors investigated sex differences in explore-exploit tradeoff using a drifting binary bandit task in rodents. The authors tried to claim that males and females use different means to achieve similar levels of accuracy in making explore-exploit decisions. In particular, they argue that females explore less but learn more quickly during exploration. The topic is very interesting, but I am not yet convinced on the conclusions.

      Here are my major points:

      1) This paper showed that males explore more than females, and through computational modeling, they showed that females have a higher learning rate compared to males. The fact that males explore more and have lower learning rates compare to females, can be an interesting finding as the paper tried to claim, but it can also be that female rats simply learn the task better than male rats in the task used.

      We have revised the manuscript to better demonstrate that male mice did not acquire fewer rewards than females, and included all analyses and plots requested in this review. Ultimately, there was no evidence that they learned the task any less well than the females did. We appreciated this comment because it has strengthened the evidence we were able to present that males and females take different paths to the same outcome. Completing these analyses has also allowed us to clarify the relationship between RL learning rates and performance in this classic dynamic decision-making task.

      (a) First, from Figure 1B, it looks like p(reward, chance) are similar between sex, but visually the female rats' performances, p(reward, obtained), look slight better than males. It would be nice if the authors could show a bar plot comparison like in Figure 1C and 1E. A non-significant test here only fails to show sex differences in performance, but it cannot be concluded that there are no sex differences in performance here. Further evidence needs to be reported here to help readers see whether there are qualitative differences in performances at all.

      The requested bar plot has been added in as Figure 1C and illustrates our central point: male mice did not acquire fewer rewards than females, so there is no evidence that they learned the task any less well than the females did. The t-test result we originally reported suggests that we can discard the hypothesis that males and females have different mean levels of percent reward obtained, but we take the reviewer’s point that the male and female distributions may differ in other, more subtle ways. Therefore, we conducted a better statistical test here. The Kolmogorov-Smirnov (KS) test takes into account not only the means of the distributions but also the shapes of the distributions. The null hypothesis is that both groups were sampled from populations with identical distributions. It tests for any violation of that null hypothesis -- different medians, different variances, or different distributions. The KS test suggested that males and females are not just not significantly different in their reward acquisition performance (Kolmogorov-Smirnov D = 0.1875, p = 0.94), but that males and females have the same distribution of performance.

      New text from the manuscript (page 5, line 119-128):

      “There was no significant sex difference in the probability of rewards acquired above chance (Figure 1C, main effect of sex, F(1, 30) = 0.05, p = 0.83). While the mean of percent reward obtained did not differ across sexes, we consider the possibility that the distribution of reward acquisition in males and females might be different. We conducted the Kolmogorov-Smirnov (KS) test, which takes into account not only the means of the distributions but also the shapes of the distributions. The KS test suggested that males and females are not just not significantly different in their reward acquisition performance (Kolmogorov-Smirnov D = 0.1875, p = 0.94), but that males and females have the same distributions for reward acquisition. This result demonstrates equivalently strong understanding and performance of the task in both males and females.”

      (b) The exploration and exploitation states are defined by fitting a hidden Markov model. In the exploration phase, the agent chooses left and right randomly. From Figure 1E and 1F, it looks like for male rats, they choose completely randomly 70% of the times (around 50% for females). The exploration state here is confounded with the state of pure guessing (poor performance).

      This comment seems to confuse our descriptive HMM with a generative model. The HMM does not imply that choices are being made randomly. Instead, exploratory choices are modeled as a uniform distribution over choices. This was done only because this is the maximum entropy distribution for a categorical variable -- the distribution that makes the fewest assumptions about the true underlying distribution and thus does not bias the model towards or away from any particular pattern of choices during exploration. For example, (Ebitz et al., 2019) have shown that the HMM can recover periods of exploration that are highly structured and information- maximizing, despite being modeled in exactly this way.

      Because the model does not imply or require that exploratory choices are random, we could, in the future, ask whether these choices reflect random exploration or instead more directed forms of exploration. However, for various reasons, this task is not the ideal testbed for isolating random and directed exploration, though this is a direction we hope to go in the future. To clarify our model and address these issues for future research, we have added the following text (page 31, line 745-756):

      “The emissions model for the explore state was uniform across the options. The emissions model for the explore state was uniform across the options:

      This is simply the maximum entropy distribution for a categorical variable - the distribution that makes the fewest number of assumptions about the true distribution and thus does not bias the model towards or away from any particular type of high-entropy choice period. This doesn’t require, imply, impose, or exclude that decision-making happening under exploration is random. Ebitz et al. 2019 have shown that exploration was highly structured and information-maximizing, despite being modeled as a uniform distribution over choices (Ebitz et al., 2020, 2019). Because exploitation involves repeated sampling of each option, exploit states only permitted choice emissions that matched one option.”

      (c) Figure 2 basically says that you can choose randomly for two reasons, to be more "noisy" in your decisions (have a higher temperature term), or to ignore the values more (by having a learning rate of 0, you are just guessing). It would be nice to show a simulation of p(reward, obtained) by learning rate x inverse temperature (like in Figure 2C). From Figure 2B, it looks like higher learning rates means better value learning in this task. It seems to me that it's more likely the male rats simply learn the task more poorly and behave more randomly which show up as more exploration in the HMM model.

      This is an important comment and addressing it gave us a chance to show the complicated, nonlinear relationship between learning rate term and performance in this task. Per the reviewer’s request, we now include a plot showing how learning rate (ɑ) and inverse temperature (β)affect reward acquisition (Figure 3F). However, this figure demonstrates that higher learning rate does not mean better performance in this task. Performing well in this task requires both the ability to learn new information and the ability to hang onto the information that has already been learned. That can only happen when learning rates are moderate, not maximal. When the learning rate is maximal, behavior is reduced to a win-stay lose-shift policy, where only the outcome of the previous trial is taken into account for choice. This actually results in a lower percent of the reward obtained. We have addressed the difference between the learning rate parameter in the reinforcement learning (RL) model and actual learning performance in the comment above. We believe that this new figure illustrates an essential point that different strategies could result in the same learning performance.

      This result shows that the male strategy was a valid one that doesn’t perform worse than the female strategy. Not only did they have identical performance (Figure 1C), but their optimized RL parameters put them both within the same predicted performance gradient in this new plot (Figure 3F). That’s exactly why we believe that it is important to understand differences in how individuals approach the same task, even as they may achieve the same overall levels of performance.

      New text from the manuscript (page 14, line 368-385):

      “While females had significantly higher learning rate (α) than males, they did not obtain more rewards than males. This is because the learning rate parameter in an RL model does not equate to the learning performance, which is better measured by the number of rewards obtained. The learning rate parameter reflects the rate of value updating from past outcomes. Performing well in this task requires both the ability to learn new information and the ability to hang onto the previously learned information. That occurs when the learning rate is moderate but not maximal. When the learning rate is maximal (α = 1), only the outcome of the immediate past trial is taken into account for the current choice. This essentially reduces the strategy to a win-stay lose-shift strategy, where choice is fully dependent on the previous outcome. A higher learning rate in a RL model does not translate to better reward acquisition performance. To illustrate that different combinations of learning rate and decision noise can result in the same reward acquisition performance. We conducted computer simulations of 10,000 RL agents defined by different combinations of learning rate (α) and inverse temperature (β) and plotted their reward acquisition performance for the restless bandit task (Figure 3F). This figure demonstrates that 1) different learning rate and inverse temperature combinations can result in similar performance, 2) the optimal reward acquisition is achieved when learning rate is moderate. This result suggested that not only did males and females had identical performance, their optimized RL parameters put them both within the same predicted performance gradient in this plot.”

      (d) From figure 3E, it looks like female rats learn better across days but male rats do not, but I am not sure. If you plot p(reward, obtained) vs times(days), do you see an improvement in female rats as opposed to males? Figure 4 also showed that females show more win-stay-lose-shift behavior and use past information more, both are indicators of better learning in this task.

      Taken the above together, I am not convinced about the strategic sex differences in exploration, it looks more like that the female rats simply learn better in this task.

      Unfortunately, there was no change in performance across days in either males or females. Per request by the reviewer, we now included a new plot illustrating p (reward,obtained) over days in Supplemental Figure 1. Ultimately, this resonated with the points we clarified above and demonstrated in this figure: males and females had identical performance in this task.

      To the other points raised here, about sex differences in win-stay lose-shift and mutual information: these are the strategic differences at the heart of the paper, but again did not alter overall performance for the reasons detailed above. Figure 4 did show that females were doing more win-stay. However, after further examining win-stay behavior by explore-exploit states, we found that females were only doing more win stay during exploratory trials (Figure 5E). There was no difference in win-stay during the exploitative trials. Figure 5F also demonstrated that females did more win-stay lose- shift in the exploration state, indicating that females only learned better during exploration. Although males learned slower during exploration, they compensated that by exploring for longer. Both male and female strategies are equally effective and may be differentially advantageous in different tasks.

      Finally, to address the meta-learning: in developing our response to this comment and looking for any other signs of adaptation across days (sex differenced or not), we did revisit this results and decided to rewrite some passages to be more circumscribed about our interpretations. Figure 3E showed increased learning rate parameters across days in females. We were initially excited about this idea of meta-learning, however we find no other evidence of adaptation over time in multiple behavioral measures, including reward acquisition, response time, and retrieval time (Supplemental Figure 1). Changes in learning rate parameters over sessions from the RL model were marginally significant and we feel that it’s worth mentioning for completeness, but it was only a small contributor to the overall sex differences in the behavioral profile. As a result we have toned down the conclusion we drew from this result accordingly.

      New text from the manuscript (page 4, line 93-113):

      “It is worth noting that unlike other versions of bandit tasks such as the reversal learning task, in the restless bandit task, animals were encouraged to continuously learn about the most rewarding choice(s). There is no asymptotic performance during the task because the reward probability of each choice constantly changes. The performance is best measured by the amount of obtained reward. Prior to data collection, both male and female mice had learned to perform this task in the touchscreen operant chamber. To examine whether mice had learned the task, we first calculated the average probability of reward acquisition across sessions in males and females (Supplemental Figure 1A). There was no significant changes in the reward acquisition performance across sessions in both sexes, demonstrating that both males and females have learned to perform the task and had reached an asymptotic level of performance across sessions (two-way repeated measure ANOVA, main effect of session, p = 0.71). Then we examine two other primary behavioral metrics across sessions that are associated with learning: response time and reward retrieval time (Supplemental Figure 1B, C). Response time was calculated as the time elapsed between the display onset and the time when the nose poke response was completed. Reward retrieval time was measured as the time elapsed between nose-poke response and magazine entry for reward collection. There was no significant change in response time (two-way repeated measure ANOVA, main effect of session, p = 0.39) and reward retrieval time (main effect of session, p = 0.71) across sessions in both sexes, which again demonstrated that both sexes have learned how to perform the task. Since both sexes have learned to perform the task prior to data collection, variabilities in task performance are results of how animals learned and adapted their choices in response to the changing reward contingencies.”

      page 14, line 386-390:

      “One interesting finding is that, when compared learning rate across sessions within sex, females, but not males, showed increased learning rate over experience with task (Figure 3G, repeated measures ANOVA, female: main effect of time, F (2.26,33.97) = 5.27, p = 0.008; male: main effect of time, F(2.5,37.52) = 0.23, p = 0.84). This points to potential sex differences in meta-learning that could contribute to the differential strategies across sexes.”

      2) I do like how the authors define exploration states vs exploitation states via HMM using choices alone. It would be interesting to see how the sex differences in reaction time are modulated by exploration vs exploitation state. As the authors showed, RT in exploration state is longer. Hence, it would make a conceptual difference whether the sex difference in reaction times is due to different proportions of time spent on exploration vs exploitation across sex.

      That is a very interesting idea. We tested for this possibility by calculating a two-way ANOVA (with interaction) between explore-exploit state and sex in predicting RT. There was a significant main effect of state (RT is longer in explore state than exploit state, main effect of state: F (1,30) = 13.07, p = 0.0011), but males were slower during females during both exploitation and exploration (main effect of sex, F(1,30) = 14.15, p = 0.0007) and there was no significant interaction (F (1,30) = 0.279, P = 0.60). Unfortunately, this means that we cannot interpret the response time difference between males and females as a consequence of the greater male tendency to explore. Response time is a fairly noisy primary behavior metric, especially in the males, and a lot of other factors might be at play here, some of which we plan to follow up on in the future. We report this result as follows (page 10, line 248-254):

      “Since males had more exploratory trials, which took longer, we tested the possibility that the sex difference in response time was due to prolonged exploration in male by calculating a two- way ANOVA between explore-exploit state and sex in predicting response time. There was a significant main effect of state (main effect of state: F (1,30) = 13.07, p = 0.0011), but males were slower during females during both exploitation and exploration (main effect of sex, F(1,30) = 14.15, p = 0.0007) and there was no significant interaction (F (1,30) = 0.279, P = 0.60).”

      Reviewer #3:

      In the manuscript 'Sex differences in learning from exploration', Chen and colleagues investigated sex differences in decision making behavior during a two-armed spatial restless bandit task. Sex differences and exploration dysregulation has been observed in various neuropsychiatric disorders. Yet, it has been unclear whether sex differences in exploration and exploitation contributes to sex-linked vulnerabilities in neuropsychiatric disorders.

      Chen and colleagues applied comprehensive modeling (model free Hidden Markov model (HMM), and various reinforcement learning (RL) models) and behavioral analysis (analysis of choice behavior using the latent variables extracted from HMM), to answer this question. They found that male mice explored more than female mice and were more likely to spend an extended period of their time exploring before committing to a favored choice. In contrast, female mice were more likely to show elevated learning during the exploratory period, making exploration more efficient and allowing them to start exploiting a favored choice earlier.

      Overall, I find the question studied in this work interesting, and compelling. Also, the results were convincing and the analysis through. However, assumptions in the proposed HMM is not fully justified and additional analyses are needed to strengthen authors' claims. To be more specific, the effect of obtained reward on state transitions, and biased exploitations should be further explored.

      Thank you for your feedback. We have included two more complex versions of the Hidden Markov models (HMMs) that account for the effect of obtained reward on state transitions and biased exploitations. Although the additional parameters slightly improve the model fit, model comparison tests suggested that such improvement was not significant. We decided to use the original HMM from the original manuscript because it’s the simplest and best fit model that provides the best parameter estimation with the amount of data we have. We do appreciate the comments and believe that the inclusion of two new HMMs and justification of the original HMM has strengthened our claims.

    1. Author Response

      Reviewer #2 (Public Review):

      I believe the authors succeeded in finding neural evidence of reactivation during REM sleep. This is their main claim, and I applaud them for that. I also applaud their efforts to explore their data beyond this claim, and I think they included appropriate controls in their experimental design. However, I found other aspects of the paper to be unclear or lacking in support. I include major and medium-level comments:

      Major comments, grouped by theme with specifics below:

      Theta.

      Overall assessment: the theta effects are either over-emphasized or unclear. Please either remove the high/low theta effects or provide a better justification for why they are insightful.

      Lines ~ 115-121: Please include the statistics for low-theta power trials. Also, without a significant difference between high- and low-theta power trials, it is unclear why this analysis is being featured. Does theta actually matter for classification accuracy?

      Lines 123-128: What ARE the important bands for classification? I understand the point about it overlapping in time with the classification window without being discriminative between the conditions, but it still is not clear why theta is being featured given the non-significant differences between high/low theta and the lack of its involvement in classification. REM sleep is high in theta, but other than that, I do not understand the focus given this lack of empirical support for its relevance.

      Line 232-233: "8). In our data, trials with higher theta power show greater evidence of memory reactivation." Please do not use this language without a difference between high and low theta trials. You can say there was significance using high theta power and not with low theta power, but without the contrast, you cannot say this.

      Thank you, we have taken this point onboard. We thought the differences observed between classification in high and low theta power trials were interesting, but we can see why the reviewer feels there is a need for a stronger hypothesis here before reporting them. We have therefore removed this approach from the manuscript, and no longer split trials into high and low theta power.

      Physiology / Figure 2.

      Overall assessment: It would be helpful to include more physiological data.

      It would be nice, either in Figure 2 or in the supplement, to see the raw EEG traces in these conditions. These would be especially instructive because, with NREM TMR, the ERPs seem to take a stereotypical pattern that begins with a clear influence of slow oscillations (e.g., in Cairney et al., 2018), and it would be helpful to show the contrast here in REM.

      We thank the reviewer for these comments. We have now performed ERP and time-frequency analyses following a similar approach to that of (Cairney et al., 2018). We have added a section in the results for these analyses as follows:

      “Elicited response pattern after TMR cues

      We looked at the TMR-elicited response in both time-frequency and ERP analyses using a method similar to the one used in (Cairney et al., 2018), see methods. As shown in Figure 2a, the EEG response showed a rapid increase in theta band followed by an increase in beta band starting about one second after TMR onset. REM sleep is dominated by theta activity, which is thought to support the consolidation process (Diekelmann & Born, 2010), and increased theta power has previously been shown to occur after successful cueing during sleep (Schreiner & Rasch, 2015). We therefore analysed the TMR-elicited theta in more detail. Focussing on the first second post-TMR-onset, we found that theta was significantly higher here than in the baseline period, prior to the cue [-300 -100] ms, for both adaptation (Wilcoxon signed rank test, n = 14, p < 0.001) and experimental nights (Wilcoxon signed rank test, n = 14, p < 0.001). The absence of any difference in theta power between experimental and adaptation conditions (Wilcoxon signed rank test, n = 14, p = 0.68), suggests that this response is related to processing of the sound cue itself, not to memory reactivation. Turning to the ERP analysis, we found a small increase in ERP amplitude immediately after TMR onset, followed by a decrease in amplitude 500ms after the cue. Comparison of ERPs from experimental and adaptation nights showed no significant difference, (n= 14, p > 0.1). Similar to the time-frequency result, this suggests that the ERPs observed here relate to the processing of the sound cues rather than any associated memory.“

      And we have updated Figure 2.

      Also, please expand the classification window beyond 1 s for wake and 1.4 s for sleep. It seems the wake axis stops at 1 s and it would be instructive to know how long that lasts beyond 1 s. The sleep signal should also go longer. I suggest plotting it for at least 5 seconds, considering prior investigations (Cairney et al., 2018; Schreiner et al., 2018; Wang et al., 2019) found evidence of reactivation lasting beyond 1.4 s.

      Regarding the classification window, this is an interesting point. TMR cues in sleep were spaced 1.5 s apart and that is why we included only this window in our classification. Extending our window beyond 1.5 s would mean that we considered the time when the next TMR cue was presented. Similarly, in wake the duration of trials was 1.1 s thus at 1.1 s the next tone was presented.

      Following the reviewer’s comment, we have extended our window as requested even though this means encroaching on the next trial. We do this because it could be possible that there is a transitional period between trials. Thus, when we extended the timing in wake and looked at reactivation in the range 0.5 s to 1.6 s we found that the effect continued to ~1.2 s vs adaptation and chance, e.g. it continued 100 ms after the trial. Results are shown in the figures below.

      Temporal compression/dilation.

      Overall assessment: This could be cut from the paper. If the authors disagree, I am curious how they think it adds novel insight.

      Line 179 section: In my opinion, this does not show evidence for compression or dilation. If anything, it argues that reactivation unfolds on a similar scale, as the numbers are clustered around 1. I suggest the authors scrap this analysis, as I do not believe it supports any main point of their paper. If they do decide to keep it, they should expand the window of dilation beyond 1.4 in Figure 3B (why cut off the graph at a data point that is still significant?). And they should later emphasize that the main conclusion, if any, is that the scales are similar.

      Line 207 section on the temporal structure of reactivation, 1st paragraph: Once again, in my opinion, this whole concept is not worth mentioning here, as there is not really any relevant data in the paper that speaks to this concept.

      We thank the reviewer for these frank comments. On consideration, we have now removed the compression/dilation analysis.

      Behavioral effects.

      Overall assessment: Please provide additional analyses and discussion.

      Lines 171-178: Nice correlation! Was there any correlation between reactivation evidence and pre-sleep performance? If so, could the authors show those data, and also test whether this relationship holds while covarying our pre-sleep performance? The logic is that intact reactivation may rely on intact pre-sleep performance; conversely, there could be an inverse relationship if sleep reactivation is greater for initially weaker traces, as some have argued (e.g., Schapiro et al., 2018). This analysis will either strengthen their conclusion or change it -- either outcome is good.

      Thanks for these interesting points. We have now performed a new analysis to check if there was a correlation between classification performance and pre-sleep performance, but we found no significant correlation (n = 14, r = -0.39, p = 0.17). We have included this in the results section as follows:

      “Finally, we wanted to know whether the extent to which participants learned the sequence during training might predict the extent to which we could identify reactivation during subsequent sleep. We therefore checked for a correlation between classification performance and pre-sleep performance to determine whether the degree of pre-sleep learning predicted the extent of reactivation, this showed no significant correlation (n = 14, r = -0.39, p = 0.17). “

      Note that we calculated the behavioural improvement while subtracting pre-sleep performance and then normalising by it for both the cued and un-cued sequences as follows:

      [(random blocks after sleep - the best 4 blocks after sleep) – (random blocks pre-sleep – the best 4 blocks pre-sleep)] / (random blocks pre-sleep – the best 4 blocks pre-sleep).

      Unlike Schönauer et al. (2017), they found a strong correspondence between REM reactivation and memory improvement across sleep; however, there was no benefit of TMR cues overall. These two results in tandem are puzzling. Could the authors discuss this more? What does it mean to have the correlation without the overall effect? Or else, is there anything else that may drive the individual differences they allude to in the Discussion?

      We have now added a discussion of this point as follows:

      “We are at a very early phase in understanding what TMR does in REM sleep, however we do know that the connection between hippocampus and neocortex is inhibited by the high levels of Acetylcholine that are present in REM (Hasselmo, 1999). This means that the reactivation which we observe in the cortex is unlikely to be linked to corresponding hippocampal reactivation, so any consolidation which occurs as a result of this is also unlikely to be linked to the hippocampus. The SRTT is a sequencing task which relies heavily on the hippocampus, and our primary behavioural measure (Sequence Specific Skill) specifically examines the sequencing element of the task. Our own neuroimaging work has shown that TMR in non-REM sleep leads to extensive plasticity in the medial temporal lobe (Cousins et al., 2016). However, if TMR in REM sleep has no impact on the hippocampus then it is quite possible that it elicits cortical reactivation and leads to cortical plasticity but provides no measurable benefit to Sequence Specific Skill. Alternatively, because we only measured behavioural improvement right after sleep it is possible that we may have missed behavioural improvements that would have emerged several days later, as we know can occur in this task (Rakowska et al., 2021).”

      Medium-level comments

      Lines 63-65: "We used two sequences and replayed only one of them in sleep. For control, we also included an adaptation night in which participants slept in the lab, and the same tones that would later be played during the experimental night were played."

      I believe the authors could make a stronger point here: their design allowed them to show that they are not simply decoding SOUNDS but actual memories. The null finding on the adaptation night is definitely helpful in ruling this possibility out.

      We agree and would like to thank the reviewer for this point. We have now included this in the text as follows: “This provided an important control, as a null finding from this adaptation night would ensure that we are decoding actual memories, not just sounds. “

      Lines 129-141: Does reactivation evidence go down (like in their prior study, Belal et al., 2018)? All they report is theta activity rather than classification evidence. Also, I am unclear why the Wilcoxon comparison was performed rather than a simple correlation in theta activity across TMR cues (though again, it makes more sense to me to investigate reactivation evidence across TMR cues instead).

      Thanks a lot for the interesting point. In our prior study (Belal et. al. 2018), the classification model was trained on wake data and then tested on sleep data, which enabled us to examine its performance at different timepoints in sleep. However in the current study the classifier was trained on sleep and tested on wake, so we can only test for differential replay at different times during the night by dividing the training data. We fear that dividing sleep trials into smaller blocks in this way will lead to weakly trained classifiers with inaccurate weight estimation due to the few training trials, and that these will not be generalisable to testing data. Nevertheless, following your comment, we tried this, by dividing our sleep trials into two blocks, e.g. the first half of stimulation during the night and the second half of stimulation during the night. When we ran the analysis on these blocks separately, no clusters were found for either the first or second halves of stimulation compared to adaptation, probably due to the reasons cited above. Hence the differences in design between the two studies mean that the current study does not lend itself to this analysis.

      Line 201: It seems unclear whether they should call this "wake-like activity" when the classifier involved training on sleep first and then showing it could decode wake rather than vice versa. I agree with the author's logic that wake signals that are specific to wake will be unhelpful during sleep, but I am not sure "wake-like" fits here. I'm not going to belabor this point, but I do encourage the authors to think deeply about whether this is truly the term that fits.

      We agree that a better terminology is needed, and have now changed this: “In this paper we demonstrated that memory reactivation after TMR cues in human REM sleep can be decoded using EEG classifiers. Such reactivation appears to be most prominent about one second after the sound cue onset. ”

      Reviewer #3 (Public Review):

      The authors investigated whether reactivation of wake EEG patterns associated with left- and right-hand motor responses occurs in response to sound cues presented during REM sleep.

      The question of whether reactivation occurs during REM is of substantial practical and theoretical importance. While some rodent studies have found reactivation during REM, it has generally been more difficult to observe reactivation during REM than during NREM sleep in humans (with a few notable exceptions, e.g., Schonauer et al., 2017), and the nature and function of memory reactivation in REM sleep is much less well understood than the nature and function of reactivation in NREM sleep. Finding a procedure that yields clear reactivation in REM in response to sound cues would give researchers a new tool to explore these crucial questions.

      The main strength of the paper is that the core reactivation finding appears to be sound. This is an important contribution to the literature, for the reasons noted above.

      The main weakness of the paper is that the ancillary claims (about the nature of reactivation) may not be supported by the data.

      The claim that reactivation was mediated by high theta activity requires a significant difference in reactivation between trials with high theta power and trials with low theta, but this is not what the authors found (rather, they have a "difference of significances", where results were significant for high theta but not low theta). So, at present, the claim that theta activity is relevant is not adequately supported by the data.

      The authors claim that sleep replay was sometimes temporally compressed and sometimes dilated compared to wakeful experience, but I am not sure that the data show compression and dilation. Part of the issue is that the methods are not clear. For the compression/dilation analysis, what are the features that are going into the analysis? Are the feature vectors patterns of power coefficients across electrodes (or within single electrodes?) at a single time point? or raw data from multiple electrodes at a single time point? If the feature vectors are patterns of activity at a single time point, then I don't think it's possible to conclude anything about compression/dilation in time (in this case, the observed results could simply reflect autocorrelation in the time-point-specific feature vectors - if you have a pattern that is relatively stationary in time, then compressing or dilating it in the time dimension won't change it much). If the feature vectors are spatiotemporal patterns (i.e., the patterns being fed into the classifier reflect samples from multiple frequencies/electrodes / AND time points) then it might in principle be possible to look at compression, but here I just could not figure out what is going on.

      Thank you. We have removed the analysis of temporal compression and dilation from the manuscript. However, we wanted to answer anyway. In this analysis, raw data were smoothed and used as time domain features. The data was then organized as trials x channels x timepoints then we segmented each trial in time based on the compression factor we are using. For instance, if we test if sleep is 2x faster than wake we look at the trial lengths in wake which was 1.1 sec. and we take half of this value which is 0.55 sec. we then take a different window in time from sleep data such that each sleep trial will have multiple smaller segments each of 0.55 sec., we then add those segments as new trials and label them with the respective trial label. Afterwards, we resize those segments temporally to match the length of wake trials. We now reshape our data from trials x channels x timepoints to trials x channels_timepoints so we aggregate channels and timepoints into one dimension. We then feed this to PCA to reduce the dimensionality of channels_timepoints into principal components. We then feed the resultant features to a LDA classifier for classification. This whole process is repeated for every scaling factor and it is done within participant in the same fashion the main classification was done and the error bars were the standard errors. We compared the results from the experimental night to those of the adaptation night.

      For the analyses relating to classification performance and behavior, the authors presently show that there is a significant correlation for the cued sequence but not for the other sequence. This is a "difference of significances" but not a significant difference. To justify the claim that the correlation is sequence-specific, the authors would have to run an analysis that directly compares the two sequences.

      Thanks a lot. We have now followed this suggestion by examining the sequence specific improvement after removing the effect of the un-cued sequence from the cued sequence. This was done by subtracting the improvement of the un-cued sequence from the improvement for the cued sequence, and then normalising the result by the improvement of the un-cued sequence. The resulting values, which we term ‘cued sequence improvement’ showed a significant correlation with classification performance (n = 14, r = 0.56, p = 0.04). We have therefore amended this section of the manuscript as follows: We have updated the text as follows: “We therefore set out to determine whether there was a relationship between the extent to which we could classify reactivation and overnight improvement on the cued sequence. This revealed a positive correlation (n = 14, r = 0.56, p = 0.04), Figure 3b.”

    1. Author response:

      Reviewer #1 (Public Review):

      In this study, Girardello et al. use proteomics to reveal the membrane tension sensitive caveolin-1 interactome in migrating cells. The authors use EM and surface rendering to demonstrate that caveolae formed at the rear of migrating cells are complex membrane-linked multilobed structures, and they devise a robust strategy to identify caveolin-1 associated proteins using APEX2-mediated proximity biotinylation. This important dataset is further validated using proximity ligation assays to confirm key interactions, and follows up with an interrogation of a surprising relationship between caveolae and RhoGTPase signalling, where caveolin-1 recruits ROCK1 under high membrane tension conditions, and ROCK1 activity is required to reform caveolae upon reversion to isotonic solution. However, caveolin-1 recruits the RhoA inactivator ARHGAP29 when membrane tension is low and ARHGAP29 overexpression leads to disassembly of caveolae and reduced cell motility. This study builds on previous findings linking caveolae to positive feedback regulation of RhoA signalling, and provides further evidence that caveolae serve to drive rear retraction in migration but also possess an intrinsic brake to limit RhoA activation, leading the authors to suggest that cycles of caveolae assembly and disassembly could thereby be central to establish a stable cell rear for persistent cell migration

      A major strength of the manuscript is the robust proteomic dataset. The experimental set up is well defined and mostly well controlled, and there is good internal validation in that the high abundance of core caveolar proteins in low membrane tension (isotonic) conditions, and absence under high membrane tension (brief hypo-osmotic shock) conditions, correlating very well with previous finding. The data could however be better presented to show where statically robust changes occur, and supplementary information should include a table of showing abundance. It's very good to see a link to PRIDE, providing a useful resource for the community.

      We thank the reviewer for the positive feedback. We have included the outputs from the search engine in Supplementary File 1.

      The authors detail several known interactions and their mechanosensitivty, but also report new interactors of caveolin-1. Several mechanosensitive interactions of caveolin-1 take place at the cell rear, but others are more diffuse across the cell looking at the PLA data (e.g FLN1, CTTN, HSPB1; Figure 4A-F and Figure 4 supplement 1). It is interesting to speculate that those at the cell rear are involved in caveolae, whilst others are linked specifically to caveolin-1 (e.g. dolines). PLA or localisation analysis with Cavin1/PTRF may be able to resolve this and further specify caveolae versus non-caveolae mechanosensitive interactions.

      We thank the reviewer for this interesting idea. It is true that many if not most proteins we identified to be associated with Cav1 are not restricted to the cell rear. To analyse to what extent the identified proteins interact with Cav1 at the rear we reanalysed our PLA data for some of the antibody combinations we looked at. This new analysis is now shown in Fig 5G. As expected, for Cav1/PTRF and Cav1/EHD2 most PLA dots (70-80%) were found at the rear. This rear bias is also evident from the representative images we show in the Figure panels 5A and 5E. On the contrary, much fewer PLA dots (~40%) were rear-localised for Cav1/CTTN and Cav1/FLNA antibody combinations. This reflects the much broader cellular distribution of these proteins compared to the core caveolae proteins, and might suggest that there are generally few links between caveolae and cortical actin. However, it is also possible that such links/interactions are more difficult to detect using PLA (because of the extended distance between caveolae and the actin cortex, or because of steric constraints).

      The Cav1/ARHGAP29 influence on YAP signalling is interesting, but appear to be quite isolated from the rest of the manuscript. Does overexpression of ARHGAP29 influence YAP signalling and/or caveolar protein expression/Cav1pY14?

      Our data and published work originally prompted us to speculate that there is a potential functional link between Cav1, YAP, and ARHGAP29. In an attempt to address this we have performed several Western blots on cell lysates from cells overexpressing ARHGAP29. We did not see major changes in Cav1 Y14 phosphorylation levels in cells overexpressing ARHGAP29, and YAP and pYAP levels also remained unchanged (not shown). In addition, based on previous literature 1,2 we expected to see an effect on ARHGAP29 mRNA levels and YAP target gene transcripts in Cav1 siRNA transfected cells. To our surprise, the mRNA levels of three independent YAP target genes and ARHGAP29 were unchanged in Cav1 siRNA treated cells (this is now shown in Figure 6 Figure Supplement 1). Our data therefore suggest that in RPE1 cells, the connection between Cav1 and ARHGAP29 is independent of YAP signalling, and that the increase in ARHGAP29 protein levels observed in Cav1 siRNA cells is due to some unknown post-translational mechanism.

      ARHGAP29 and RhoA/ROCK1 related observations are very interesting and potentially really important. However, the link between ARHGAP29 and caveolae is not well established (other than in proteomic data). PLA or FRET could help establish this.

      We agree that the physical and functional link between caveolae (or Cav1) and ARHGAP29 was not well worked out in the original manuscript. In an attempt to address this we have performed PLA assays in GFP-ARHGAP29 transfected cells (as we did not find a suitable ARHGAP29 antibody that works reliably in IF) using anti-Cav1 and anti-GFP antibodies. The PLA signal we obtained for Cav1 and ARHGAP29 was not significantly different to control PLA experiments. There was very little PLA signal to start with. This is not surprising given that ARHGAP29 localisation is mostly diffuse in the cytoplasm, whilst Cav1 is concentrated at the rear. In addition, in cases where we do see ARHGAP29 localisation at the cell cortex, Cav1 tends to be absent (this is now shown in Figure 6 – Figure Supplement 2E). In other words, with the tools we have available, we see little colocalization between Cav1 and ARHGAP29 at steady state. Altogether we speculate that ARHGAP29, through its negative effect on RhoA, flattens caveolae at the membrane or interferes with caveolae assembly at these sites.

      This of course prompts the question why ARHGAP29 was identified in the Cav1 proteome with such specificity and reproducibility in the first place? This can be explained by the way APEX2 labeling works. Proximity biotinylation with APEX2 is extremely sensitive and restricted to a labelling radius of ~20 nm 3. The labeling reaction is conducted on live and intact cells at room temperature for 1 min. Although 1 min appears short, dynamic cellular processes occur at the time scale of seconds and are ongoing during the labelling reaction. It is conceivable that within this 1 min time frame, ARHGAP29 cycles on and off the rear membrane (kiss and run). This allows ARHGAP29 to be biotinylated by Cav1-APEX2, resulting in its identification by MS. We have included this in the discussion section.

      The relationship between ARHGAP29 and RhoA signalling is not well defined. Is GAP activity important in determining the effect on migration and caveolae formation? What is the effect on RhoA activity? Alternatively, the authors could investigate YAP dependent transcriptional regulation downstream of overexpression.

      We have addressed this point using overexpression and siRNA transfections. We overexpressed ARHGAP29 or ARHGAP29 lacking its GAP domain and performed WB analysis against pMLC (which is a commonly used and reliable readout for RhoA and myosin-II activity). Much to our surprise, overexpression of ARHGAP29 increased (rather than decreased) pMLC levels, partially in a GAP-dependent manner (see Author response image 1). This is puzzling, as ARHGAP29 is expected to reduce RhoA-GTP levels, which in turn is expected to reduce ROCK activity and hence pMLC levels. In addition, and also surprisingly, siRNA-mediated silencing of ARHGAP29 did not significantly change pMLC levels. By contrast, pMLC levels were strongly reduced in Cav1 siRNA treated cells (this is shown in Fig. 6A and 6B in the revised manuscript). These new data underscore the important role of caveolae in the control of myosin-II activity, but do not allow us to draw any firm conclusions about the role of ARHGAP29 at the cell rear.

      Author response image 1.

      Overexpression of ARHGAP29 reduces, rather than increases pMLC in RPE1 cells.

      We are uncertain as to how to interpret the ARHGAP29 overexpression data presented in Author response image 1 and therefore decided not to include it in the manuscript. One possibility is that inactivation of RhoA below a certain critical threshold causes other mechanisms to compensate. For instance, the activity of alternative MLC kinases such as MLCK could be enhanced under these conditions. Another possibility is that ARHGAP29 controls MLC phosphorylation indirectly. For instance, it has been shown that ARHGAP29 promotes actin destabilization through inactivating LIMK/cofilin signalling 1. In agreement with this, we find that overexpression of ARHGAP29 reduces p-cofilin (serine 3) levels (see Author response image 2). Since cofilin and MLC crosstalk 4, it is possible that increased pMLC levels are the result of a feedback loop that compensates for the effect of actin depolymerisation. This is now discussed in the discussion section. Whichever the case, we hope the reviewers understand that deeper mechanistic insight into the intricate mechanisms of Rho signalling at the cell rear are beyond the scope of this manuscript.

      Author response image 2.

      Overexpression of ARHGAP29 reduces p-cofilin levels in RPE1.

      Reviewer #2 (Public Review):

      Girardello et al investigated the composition of the molecular machinery of caveolae governing their mechano-regulation in migrating cells. Using live cell imaging and RPE1 cells, the authors provide a spatio-temporal analysis of cavin-3 distribution during cell migration and reveal that caveolae are preferentially localized at the rear of the cell in a stable manner. They further characterize these structures using electron tomography and reveal an organization into clusters connected to the cell surface. By performing a proteomic approach, they address the interactome of caveolin-1 proteins upon mechanical stimulation by exposing RPE1 cells to hypo-osmotic shock (which aims to increase cell membrane tension) or not as a control condition. The authors identify over 300 proteins, notably proteins related to actin cytoskeleton and cell adhesion. These results were further validated in cellulo by interrogating protein-protein interactions using proximity ligation assays and hypo-osmotic shock. These experiments confirmed previous data showing that high membrane tension induces caveolae disassembly in a reversible manner. Eventually, based on literature and on the results collected by the proteomic analysis, authors investigated more deeply the molecular signaling pathway controlling caveolae assembly upon mechanical stimuli. First, they confirm the targeting of ROCK1 with Caveolin-1 and the implication of the kinase activity for caveolae formation (at the rear of the cell). Then, they show that RhoGAP ARHGAP29, a factor newly identified by the proteomic analysis, is also implicated in caveolae mechano-regulation likely through YAP protein and found that overexpression of RhoGAP ARHGAP29 affects cell motility. Overall, this paper interrogated the role of membrane tension in caveolae located at the rear of the cell and identified a new pathway controlling cell motility.

      Strengths:

      Using a proximity-based proteomic assay, the authors reveal the protein network interacting with caveolae upon mechanical stimuli. This approach is elegant and allows to identify a substantial new set of factors involved in the mechano-regulation of caveolin-1, some of which have been verified directly in the cell by PLA. This study provides a compelling set of data on the interactions between caveolae and its cortical network which was so far ill-characterized.

      We thank the reviewer for this positive feedback.

      Weaknesses:

      The methodology demonstrating an impact of membrane tension is not precise enough to directly assess a direct role on caveolae at a subcellular scale, that is between the front and the rear of the cell. First, a better characterization of the "front-rear" cellular model is encouraged.

      We agree with the reviewer that a quantitative analysis of the caveolae front-rear polarity would strengthen our conclusions. To address this, we have analysed the localisation of Cav1 and cavins in detail and in a large pool of cells, both in fixed and live cells. Our quantification clearly shows that Cav1 and cavins are enriched at the cell rear. This is now shown in Figure 1 and Figure 1 - Figure Supplement 1. To demonstrate that Cav1/cavins are truly rear-localised we analysed live migrating cells expressing tagged Cav1 or cavins. This analysis, which was performed on several individual time lapse movies, showed that caveolae rear localisation is remarkably stable (e.g. Figure 1C and 1D). We also present novel data panels and movies showing caveolae dynamics during rear retractions, in dividing cells, and in cells that polarise de novo. This new data is now described in the first paragraph of the results section.

      Secondly, authors frequently present osmotic shock as "high membrane tension" stimuli. While osmotic shock is widely used in the field, this study is focused only on caveolae localized at the rear of cell and it remains unclear how the level of a global mechanical stimuli triggered by an osmotic shock could mimic a local stimuli.

      We agree with the reviewer that osmotic shock will cause a global increase in membrane tension and therefore is only of limited value to understand how membrane tension is regulated at the rear, and how caveolae respond to such a local stimulus. It was not our aim nor is it our expertise to address such questions. To answer this sophisticated optogenetic approaches or localised membrane tension measurements (e.g. through the use of the Flipper-TR probe) are needed. It is beyond the scope of this manuscript to perform such experiments. However, given the strong enrichment of caveolae at the cell rear, we believe it is justified to propose that the changes we observe in the proteome do (mostly) reflect changes in caveolae at the rear. We have now included several quantifications on fixed cells, live cells, and PLA assays to support that caveolae are highly enriched at the rear. In addition, and importantly, a recent preprint by the Roux lab shows that membrane tension gradients indeed exist in many migrating and non-migrating cells 5. Using very similar hypotonic shock assays, the Caswell lab also showed that low membrane tension at the rear is required for caveolae formation 6. We have included a section in the discussion in which we elaborate on how membrane tension is controlled in migrating cells, and how it might regulate caveolae rear localisation.

      In the present case, it remains unknown the extent to which this mechanical stress is physiologically relevant to mimic mechanical forces applied at the rear of a migrating cell.

      This is true. Our study does not address the nature of mechanical forces at the cell rear. This a complex subject that is technically challenging to address, and therefore is beyond the scope of this manuscript.

      Some images are not satisfying to fully support the conclusions of the article.

      We agree that some of the images, in particular the ones presented for the PLA assays, do not always show a clear rear localisation of caveolae. We have explained above why this is the case. We hope that our new quantitative measurements, movies and figure panels, addresses the reviewer’s concern.

      At this stage, the lack of an unbiased quantitative analysis of the spatio-temporal analysis of caveolae upon well-defined mechanical stimuli is also needed.

      These are all very good points that were previously addressed beautifully by the Caswell group 6. To address this in part in our RPE1 cell system, we imaged RPE1 cells exposed to the ROCK inhibitor Y27632 (see Author response image 3). The data shows that cell rear retraction is impeded in response to ROCK inhibition, which is in line with several previous reports. Cavin-1 remained mostly associated with the cell rear, although the distribution appeared more diffuse. We believe this data does not add much new insight into how caveolae function at the rear, and hence was not included in the manuscript.

      Author response image 3.

      Effect of ROCK inhibition on cavin1 rear localisation and rear retraction. Cells were imaged one hour after the addition of Y27632.

      Cells on images, in particular Figure 1, are difficult to see. Signal-to noise ratio in different cell area could generate a biased. Since there is inconsistency between caveolae density and localization between Figures, more solid illustrations are needed along quantitative analysis.

      As mentioned above, we have carefully analysed the localisation of caveolae in fixed cells (using Cav1 and cavin1 antibodies as well as Cav1 and cavin fusion proteins) and in live cells transfected with various different caveolae proteins. The analysis clearly demonstrates an enrichment of caveolae at the rear (Figure 1 and Figure 1 – Figure Supplement 1). Our tomography and TEM data supports this as well (Figure 2).

      References:

      1. Qiao Y, Chen J, Lim YB, et al. YAP Regulates Actin Dynamics through ARHGAP29 and Promotes Metastasis. Cell reports. 2017;19(8):1495-1502.

      2. Rausch V, Bostrom JR, Park J, et al. The Hippo Pathway Regulates Caveolae Expression and Mediates Flow Response via Caveolae. Curr Biol. 2019;29(2):242-255 e246.

      3. Hung V, Udeshi ND, Lam SS, et al. Spatially resolved proteomic mapping in living cells with the engineered peroxidase APEX2. Nat Protoc. 2016;11(3):456-475.

      4. Wiggan O, Shaw AE, DeLuca JG, Bamburg JR. ADF/cofilin regulates actomyosin assembly through competitive inhibition of myosin II binding to F-actin. Dev Cell. 2012;22(3):530-543.

      5. Juan Manuel García-Arcos AM, Julissa Sánchez Velázquez, Pau Guillamat, Caterina Tomba, Laura Houzet, Laura Capolupo, Giovanni D’Angelo, Adai Colom, Elizabeth Hinde, Charlotte Aumeier, Aurélien Roux. Actin dynamics sustains spatial gradients of membrane tension in adherent cells. bioRxiv 20240715603517. 2024.

      6. Hetmanski JHR, de Belly H, Busnelli I, et al. Membrane Tension Orchestrates Rear Retraction in Matrix-Directed Cell Migration. Dev Cell. 2019;51(4):460-475 e410.

      7. Tsai TY, Collins SR, Chan CK, et al. Efficient Front-Rear Coupling in Neutrophil Chemotaxis by Dynamic Myosin II Localization. Dev Cell. 2019;49(2):189-205 e186.

      8. Mueller J, Szep G, Nemethova M, et al. Load Adaptation of Lamellipodial Actin Networks. Cell. 2017;171(1):188-200 e116.

      9. De Belly H, Yan S, Borja da Rocha H, et al. Cell protrusions and contractions generate long-range membrane tension propagation. Cell. 2023.

      10. Matthaeus C, Sochacki KA, Dickey AM, et al. The molecular organization of differentially curved caveolae indicates bendable structural units at the plasma membrane. Nat Commun. 2022;13(1):7234.

      11. Sinha B, Koster D, Ruez R, et al. Cells respond to mechanical stress by rapid disassembly of caveolae. Cell. 2011;144(3):402-413.

      12. Lieber AD, Schweitzer Y, Kozlov MM, Keren K. Front-to-rear membrane tension gradient in rapidly moving cells. Biophysical journal. 2015;108(7):1599-1603.

      13. Shi Z, Graber ZT, Baumgart T, Stone HA, Cohen AE. Cell Membranes Resist Flow. Cell. 2018;175(7):1769-1779 e1713.

      14. Grande-Garcia A, Echarri A, de Rooij J, et al. Caveolin-1 regulates cell polarization and directional migration through Src kinase and Rho GTPases. The Journal of cell biology. 2007;177(4):683-694.

      15. Grande-Garcia A, del Pozo MA. Caveolin-1 in cell polarization and directional migration. Eur J Cell Biol. 2008;87(8-9):641-647.

      16. Ludwig A, Howard G, Mendoza-Topaz C, et al. Molecular composition and ultrastructure of the caveolar coat complex. PLoS biology. 2013;11(8):e1001640.

    1. Author Response

      Reviewer #1 (Public Review):

      The study presented by AL Seufert et al. follows the trajectory of trained immunity research in the context of sterile inflammatory diseases such as gout, cardiovascular disease and obesity. Previous studies in mice have shown that a 4 week Western-type diet is sufficient to induce systemic trained immunity, with gross reorganization of the bone marrow to support a potentiated inflammatory response [PMID: 29328911]. The current study demonstrates that mice on a Western-type diet (WD) and the more extreme Ketogenic diet (KD; where carbohydrates are essentially eliminated from the diet) for 2 weeks results in a state of increased monocyte-driven immune responsiveness when compared to standard chow diets (SC). This increased immune responsiveness after high-fat diet resulted in a deadly hyper-inflammatory in the mice in response to endotoxin (LPS) challenge in vivo.

      These initial findings as displayed in Figure 1 are made difficult to interpret because the authors use a mix of male and female mice coupled with very small sample sizes (n = 5 - 9). Male and female mice are shown to have dimorphic responses to LPS exposure in vivo, with males having elevated cytokine levels (TNF, IL-6, IL1β, and also interesting IL-10) increased rates severe outcomes to LPS challenge [PMID: 27631979]. As a reader it is impossible to discern from their methodological description what the proportion of the sexes were in each group, and therefore cannot determine if their data are skewed or biased due to sexual dimorphic responses to LPS rather than diet. Additionally due to the very small sample sizes, the authors can't perform a stratified analysis based on sex to determine whether the diets are having the greatest effects in accordance with LPS induce inflammation.

      The Reviewer brings up an important point, all studies with endotoxemia in wild-type conventional mice were carried out in 6–8-week female BALB/c mice, as mentioned in the Methods section under “Ethical approval of animal studies” and “endotoxin-induced model of sepsis” sections. This is extremely important to mention more clearly in the results text, because the Reviewer 1 is correct, sexual dimorphism and age differences can have very large effects on LPS treatment outcome. This was not stated clearly enough in the results and now the age, sex, and background of mice have been explicitly stated in each Results and Figure Legend section for each experiment.

      When comparing SC to the KD, the authors identify large changes in fatty acid distribution circulating in the blood. The majority of the fatty acids were shown to relate to saturated fatty acids (SFA). Although Lauric, Myristic, and Myristovaccenic acid where the most altered after KD, the authors focus their research on the more thoroughly studied palmitic acid (PA).

      We followed up on multiple saturated fatty acids (SFAs; Myristic, Lauric, and Behenic acid) that were identified in the lipidomic data, and found no robust or repeatable phenotypes in vitro using physiologically relevant concentrations. The inability to reproduce some of the findings with these SFAs may be due to the instability of some of these fats in solution, and plan to troubleshoot these assays in order to understand the complexity of SFA-dependent control of inflammation in macrophages. Please see Fig. R1 in this document for data showing LPS-stimulated BMDMs pre-treated with Myristic (Fig R1 A-C), Lauric (Fig R1 D-F), or Behenic (Fig R1 G-I) fatty acids. The physiological concentrations used in these studies were referenced from Perreault et. al., 2014.

      Figure R1. The effect of Myristic Acid, Lauric Acid, and Behenic Acid on the response to LPS in macrophages. Primary bone marrowderived macrophages (BMDMs) were isolated from aged-matched (6-8 wk) C57BL/6 female and male mice. BMDMs were plated at 1x106 cells/mL and treated with either ethanol (EtOH; media with 0.05% or 0.35% ethanol to match MA and LA solutions respectively), media (Ctrl), LPS (10 ng/mL) for 24 h, or myristic or lauric acid (MA, LA stock diluted in 0.05%, or 0.35% EtOH; conjugated to 2% BSA) for 24 h, with and without a secondary challenge with LPS (10 ng/mL). After indicated time points, RNA was isolated and expression of (A, B) tnf, (D, E) il- 6, and (G, H) il-1β was measured via qRT-PCR. RAW 264.7 macrophages were thawed and cultured for 3-5 days, pelleted and resuspended in DMEM containing 5% FBS and 2% BSA, and treated identical to BMDM treatments with behenic acid (BA stock diluted in 1.7% EtOH) used as the primary stimulus. (C) tnf, (F) il-6, and (I) il-1β was measured via qRT-PCR. For all plates, all treatments were performed in triplicate. For all panels, a student’s t-test was used for statistical significance. p< 0.05; p < 0.01; **p< 0.001. Error bars shown mean ± SD.

      PA was shown to increase the expression of inflammatory cytokines gene expression and protein production of TNF, IL-6 and IL-1β in bone marrow derived macrophages (BMDMs). The authors tie these effects to ceramide synthesis through a pharmacological blockade as well as the use of oleic acid, which allegedly sequesters ceramide synthesis. The author's claim that oleic acid supplementation reverses the inflammatory signaling induced by PA is invalid, as oleic acid was shown to induce a high level of cytokines in their model. When PA was added along with oleic acid, the cytokine levels returned to the levels produced by BMDM's stimulated with PA alone (see Figure 4 panels D- F).

      This was an unfortunate oversight in our revisions of this manuscript, original Figure 5A-C was mislabeled (though colored the correct colors) – OA-12h → LPS-24h should have been switched with PA-12h → LPS-24h. These data were labeled correctly in the source file: Source_data_Fig5 and have since been updated in Figure 5 of the manuscript with correct labels. The corrected graphs have been split up in the resubmission in light of new data collected. Please see Fig 3K-M and Fig 5A-C.

      Finally the authors test whether injection of PA into mice can recapitulate the systemic inflammatory response seen by WD and KD feeding followed by LPS exposure. They were able to demonstrate that injecting 1 mM of PA, waiting for 12h, and then exposing the mice to LPS for 24h could similarly result in a hyper-inflammatory state resulting in greater mortality. The reviewer is skeptical that 1 mM of PA truly represents post-prandial PA levels as one would expect to see after a single fatty meal, and whether this injection is generally well tolerated by mice. Looking into the paper cited by Eguchi et al. to inform their methods, it's shown that the earlier study continuously infused an emulsified ethyl palmitate solution (which contained 600 mM) at a rate of 0.2 uL/min. As far as I can read by Eguchi, they only managed to reach a serum PA concentration of 0.5 mM. This is hardly the same thing as a single i.p. injection of 1 mM PA. and reflects a single bolus injection of double the serum concentration of PA achieved by Eguchi et al.

      The reviewer brings up an important point, Eguchi et al. did use infusions. From their data (Fig 1A), we calculated that after 600mM of i.v. injection (total = 267uL within 14h; 0.2L/min) there was ~420uM absolute PA within the blood. They were using C57BL/6 mice that were 23g on average. Using these results, we extrapolated that one single 200uL injection of a 750mM PA solution within 6–8-week female BALB/c mice (~15-18g) would equate to ~500-1mM of PA within the blood. Considering obese healthy and unhealthy humans vary widely in total PA concentrations in the blood (0.3-4.1 mM) (1, 2), we moved forward with these calculations. Considering this, we thank the reviewer for this advice, and we agree that we have not definitively shown we are increasing systemic levels of PA. Thus, we ran a lipidomic analysis of serum from SC-fed mice with Veh or PA for 12 h. We show that a 750 mM i.p. injection of ethyl palmitate enhances free PA levels in the serum to 173-425 μM at 2 h post-injection, which is within the reported range for humans on high-fat diets (0.34.1mM). We have added this new data to Fig. S7A of the main manuscript.

      Importantly, the concentration in the PA-treated mice is greater than that of the Veh-treated mice, however we believe the value shown is an underestimate of maximum serum PA levels enhanced by i.p. injection, because free PA is known to be packaged into chylomicrons within enterocytes and travel through the circulation with a half-life of less than an hour (3, 4). Thus, serum concentrations of free PA are only transiently enhanced by i.p. injection, and is quickly taken up by adipose tissue, skeletal muscle, heart, and liver tissue. These complex lipid transport processes make it difficult to determine maximum concentrations of free PA in the serum.

      While all of the details concerning PA circulation following an i.p. injection are unknown, we suggest that this method of “force-feeding” is similar to dietary intake in that uptake of PA into the circulation occurs within the peritoneal space prior to traveling to the blood via the thoracic duct and right lymphatic duct (5).

      PA is known to induce inflammation in monocytes and macrophages, therefore the findings certainly make sense in the context of previously published literature. However the authors have made some poor methodological decisions in their mouse studies, namely haphazardly switching between groups of young and old mice (4-6 weeks, 8-9 weeks, and 14-23 weeks), using different LPS injection protocols (6, 10, and 50 mg/ml of LPS), and including multiple sexes of mice. All of which are drastically alter the interpretation of the data, and preventing solid conclusions from being drawn.

      We appreciate this review and suggest that:

      1) For the LPS models, mice were all female and aged matched between 6-8 weeks. We are aware of sex differences in the endotoxemia model, which is why we specifically use female mice in our studies (6, 7). This is mentioned twice in the methods under the sections “Endotoxin-induced model of sepsis” and “Ethical approval of animal studies”. We have added these specifics of our model to all Results and Figure Legend sections for clarification.

      2) For Germ-free models, it is notoriously difficult to breed C57BL/6 germ-free mice. It was inherently difficult to obtain enough mice within the same sex and age to carry out these experiments, however since we have published in this model before with mixed sex and age we were aware that our WD phenotype is robust enough in these backgrounds (7). Further, we believe that seeing our robust phenotype independent of age or sex within germ-free mice provides more evidence of the strength of this phenotype. It is important to note that we induce endotoxemia within Germ-free mice with 50mg/kg, instead of 6mg/kg which is used in conventional mice, because this is our reported LD50 for mixed sex Germ-free C57BL/6, as we have published previously in detail (7). This difference is due to the presence of the microbiota (8, 9) and also germ-free mice have an immature immune system that correlates with a hyporesponsiveness to microbial products (10-12). We agree with the reviewer that the ages of the C57BL/6 germ-free mice are significantly older than our conventional 6-8 week mice, thus we confirmed that WD- and KD-fed conventional C57BL/6 female mice aged 20 – 21 weeks old still show enhanced disease severity and mortality in an LPS-induced endotoxemia model, compared to mice fed SC (Fig. S1G-H).

      Figure R2. PA treatment enhances survival in both female and male RAG-/- mice. Age-matched (8-9 wk) RAG-/- mice were injected i.v. with ethyl palmitate (PA, 750mM) or vehicle (Veh) solutions 12 h before C. albicans infection. Survival was monitored for 40h post-infection.

      3) In our preliminary results, we stratified survival during C. albicans infection between male and female C57BL/6 and found no notable difference in survival at 40h post IP infection with Candida albicans (Fig R2 A-B). However, the data presented in the manuscript on CFU is female kidney burden and we do not have data on fungal burden within male mice. This is an important piece of data that we would like to collect for understanding sex differences in the PA-dependent enhanced resistance to systemic C. albicans. We are currently addressing this question within the lab as well as elucidating the cell type and mechanism of PA-dependent enhanced fungal resistance.

    1. Author Response

      Reviewer #1 (Public Review):

      Esmaily and colleagues report two experimental studies in which participants make simple perceptual decisions, either in isolation or in the context of a joint decision-making procedure. In this "social" condition, participants are paired with a partner (in fact, a computer), they learn the decision and confidence of the partner after making their own decision, and the joint decision is made on the basis of the most confident decision between the participant and the partner. The authors found that participants' confidence, response times, pupil dilation, and CPP (i.e. the increase of centro-parietal EEG over time during the decision process) are all affected by the overall confidence of the partner, which was manipulated across blocks in the experiments. They describe a computational model in which decisions result from a competition between two accumulators, and in which the confidence of the partner would be an input to the activity of both accumulators. This model qualitatively produced the variation in confidence and RTs across blocks.

      The major strength of this work is that it puts together many ingredients (behavioral data, pupil and EEG signals, computational analysis) to build a picture of how the confidence of a partner, in the context of joint decision-making, would influence our own decision process and confidence evaluations. Many of these effects are well described already in the literature, but putting them all together remains a challenge.

      We are grateful for this positive assessment.

      However, the construction is fragile in many places: the causal links between the different variables are not firmly established, and it is not clear how pupil and EEG signals mediate the effect of the partner's confidence on the participant's behavior.

      We have modified the language of the manuscript to avoid the implication of a causal link.

      Finally, one limitation of this setting is that the situation being studied is very specific, with a joint decision that is not the result of an agreement between partners, but the automatic selection of the most confident decisions. Thus, whether the phenomena of confidence matching also occurs outside of this very specific setting is unclear.

      We have now acknowledged this caveat in the discussion in line 485 to 504. The final paragraph of the discussion now reads as follows:

      “Finally, one limitation of our experimental setup is that the situation being studied is confined to the design choices made by the experimenters. These choices were made in order to operationalize the problem of social interaction within the psychophysics laboratory. For example, the joint decisions were not made through verbal agreement (Bahrami et al., 2010, 2012). Instead, following a number of previous works (Bang et al., 2017, 2020) joint decisions were automatically assigned to the most confident choice. In addition, the partner’s confidence and choice were random variables drawn from a distribution prespecified by the experimenter and therefore, by design, unresponsive to the participant’s behaviour. In this sense, one may argue that the interaction partner’s behaviour was not “natural” since they did not react to the participant's confidence communications (note however that the partner’s confidence and accuracy were not entirely random but matched carefully to the participant’s behavior prerecorded in the individual session). How much of the findings are specific to these experimental setting and whether the behavior observed here would transfer to real-life settings is an open question. For example, it is plausible that participants may show some behavioral reaction to a human partner’s response time variations since there is some evidence indicating that for binary choices such as those studied here, response times also systematically communicate uncertainty to others (Patel et al., 2012). Future studies could examine the degree to which the results might be paradigm-specific.”

      Reviewer #2 (Public Review):

      This study is impressive in several ways and will be of interest to behavioral and brain scientists working on diverse topics.

      First, from a theoretical point of view, it very convincingly integrates several lines of research (confidence, interpersonal alignment, psychophysical, and neural evidence accumulation) into a mechanistic computational framework that explains the existing data and makes novel predictions that can inspire further research. It is impressive to read that the corresponding model can account for rather non-intuitive findings, such as that information about high confidence by your collaborators means people are faster but not more accurate in their judgements.

      Second, from a methodical point of view, it combines several sophisticated approaches (psychophysical measurements, psychophysical and neural modelling, electrophysiological and pupil measurements) in a manner that draws on their complementary strengths and that is most compelling (but see further below for some open questions). The appeal of the study in that respect is that it combines these methods in creative ways that allow it to answer its specific questions in a much more convincing manner than if it had used just either of these approaches alone.

      Third, from a computational point of view, it proposes several interesting ways by which biologically realistic models of perceptual decision-making can incorporate socially communicated information about other's confidence, to explain and predict the effects of such interpersonal alignment on behavior, confidence, and neural measurements of the processes related to both. It is nice to see that explicit model comparison favor one of these ways (top-down driving inputs to the competing accumulators) over others that may a priori have seemed more plausible but mechanistically less interesting and impactful (e.g., effects on response boundaries, no-decision times, or evidence accumulation).

      Fourth, the manuscript is very well written and provides just the right amount of theoretical introduction and balanced discussion for the reader to understand the approach, the conclusions, and the strengths and limitations.

      Finally, the manuscript takes open science practices seriously and employed preregistration, a replication sample, and data sharing in line with good scientific practice.

      We are grateful to the reviewer for their positive assessment of our work.

      Having said all these positive things, there are some points where the manuscript is unclear or leaves some open questions. While the conclusions of the manuscript are not overstated, there are unclarities in the conceptual interpretation, the descriptions of the methods, some procedures of the methods themselves, and the interpretation of the results that make the reader wonder just how reliable and trustworthy some of the many findings are that together provide this integrated perspective.

      We hope that our modifications and revisions in response to the criticisms listed below will be satisfactory. To avoid redundancies, we have combined each numbered comment with the corresponding recommendation for the Authors.

      First, the study employs rather small sample sizes of N=12 and N=15 and some of the effects are rather weak (e.g., the non-significant CPP effects in study 1). This is somewhat ameliorated by the fact that a replication sample was used, but the robustness of the findings and their replicability in larger samples can be questioned.

      Our study brings together questions from two distinct fields of neuroscience: perceptual decision making and social neuroscience. Each of these two fields have their own traditions and practical common sense. Typically, studies in perceptual decision making employ a small number of extensively trained participants (approximately 6 to 10 individuals). Social neuroscience studies, on the other hand, recruit larger samples (often more than 20 participants) without extensive training protocols. We therefore needed to strike a balance in this trade-off between number of participants and number of data points (e.g. trials) obtained from each participant. Note, for example, that each of our participants underwent around 4000 training trials. Strikingly, our initial study (N=12) yielded robust results that showed the hypothesized effects nearly completely, supporting the adequacy of our power estimate. However, we decided to replicate the findings because, like the reviewer, we believe in the importance of adequate sampling. We increased our sample size to N=15 participants to enhance the reliability of our findings. However, we acknowledge the limitation of generalizing to larger samples, which we have now discussed in our revised manuscript and included a cautionary note regarding further generalizations.

      To complement our results and add a measure of their reliability, here we provide the results of a power analysis that we applied on the data from study 1 (i.e. the discovery phase). These results demonstrate that the sample size of study 2 (i.e. replication) was adequate when conditioned on the results from study 1 (see table and graph pasted below). The results showed that N=13 would be an adequate sample size for 80% power for behavoural and eye-tracking measurements. Power analysis for the EEG measurements indicated that we needed N=17. Combining these power analyses. Our sample size of N=15 for Study 2 was therefore reasonably justified.

      We have now added a section to the discussion (Lines 790-805) that communicates these issues as follows:

      “Our study brings together questions from two distinct fields of neuroscience: perceptual decision making and social neuroscience. Each of these two fields have their own traditions and practical common sense. Typically, studies in perceptual decision making employ a small number of extensively trained participants (approximately 6 to 10 individuals). Social neuroscience studies, on the other hand, recruit larger samples (often more than 20 participants) without extensive training protocols. We therefore needed to strike a balance in this trade-off between number of participants and number of data points (e.g. trials) obtained from each participant. Note, for example, that each of our participants underwent around 4000 training trials. Importantly, our initial study (N=12) yielded robust results that showed the hypothesized effects nearly completely, supporting the adequacy of our power estimate. However, we decided to replicate the findings in a new sample with N=15 participants to enhance the reliability of our findings and examine our hypothesis in a stringent discovery-replication design. In Figure 4-figure supplement 5, we provide the results of a power analysis that we applied on the data from study 1 (i.e. the discovery phase). These results demonstrate that the sample size of study 2 (i.e. replication) was adequate when conditioned on the results from study 1.”

      We conducted Monte Carlo simulations to determine the sample size required to achieve sufficient statistical power (80%) (Szucs & Ioannidis, 2017). In these simulations, we utilized the data from study 1. Within each sample size (N, x-axis), we randomly selected N participants from our 12 partpincats in study 1. We employed the with-replacement sampling method. Subsequently, we applied the same GLMM model used in the main text to assess the dependency of EEG signal slopes on social conditions (HCA vs LCA). To obtain an accurate estimate, we repeated the random sampling process 1000 times for each given sample size (N). Consequently, for a given sample size, we performed 1000 statistical tests using these randomly generated datasets. The proportion of statistically significant tests among these 1000 tests represents the statistical power (y-axis). We gradually increased the sample size until achieving an 80% power threshold, as illustrated in the figure.The the number indicated by the red circle on the x axis of this graph represents the designated sample size.

      Second, the manuscript interprets the effects of low-confidence partners as an impact of the partner's communicated "beliefs about uncertainty". However, it appears that the experimental setup also leads to greater outcome uncertainty (because the trial outcome is determined by the joint performance of both partners, which is normally reduced for low-confidence partners) and response uncertainty (because subjects need to consider not only their own confidence but also how that will impact on the low-confidence partner). While none of these other possible effects is conceptually unrelated to communicated confidence and the basic conclusions of the manuscript are therefore valid, the reader would like to understand to what degree the reported effects relate to slightly different types of uncertainty that can be elicited by communicated low confidence in this setup.

      We appreciate the reviewer’s advice to remain cautious about the possible sources of uncertainty in our experiment. In the Discussion (lines 790-801) we have now added the following paragraph.

      “We have interpreted our findings to indicate that social information, i.e. partner’s confidence, impacts the participants beliefs about uncertainty. It is important to underscore here that, similar to real life, there are other sources of uncertainty in our experimental setup that could affect the participants' belief. For example, under joint conditions, the group choice is determined through the comparison of the choices and confidences of the partners. As a result, the participant has a more complex task of matching their response not only with their perceptual experience but also coordinating it with the partner to achieve the best possible outcome. For the same reason, there is greater outcome uncertainty under joint vs individual conditions. Of course, these other sources of uncertainty are conceptually related to communicated confidence but our experimental design aimed to remove them, as much as possible, by comparing the impact of social information under high vs low confidence of the partner.”

      In addition to the above, we would like to clarify one point here with specific respect to the comment. Note that the computer-generated partner’s accuracy was identical under high and low confidence. In addition, our behavioral findings did not show any difference in accuracy under HCA and LCA conditions. As a consequence, the argument that “the trial outcome is determined by the joint performance of both partners, which is normally reduced for low-confidence partners)” is not valid because the low-confidence partner’s performance is identical to that of the high-confidence partner. It is possible, of course, that we have misunderstood the reviewer’s point here and we would be happy to discuss this further if necessary.

      Third, the methods used for measurement, signal processing, and statistical inference in the pupil analysis are questionable. For a start, the methods do not give enough details as to how the stimuli were calibrated in terms of luminance etc so that the pupil signals are interpretable.

      Here we provide in Author response image 1 the calibration plot for our eye tracking setup, describing the relationship between pupil size and display luminance. Luminance of the random dot motion stimuli (ie white dots on black background) was Cd/m2 and, importantly, identical across the two critical social conditions. We hope that this additional detail satisfies the reviewer’s concern. For the purpose of brevity, we have decided against adding this part to the manuscript and supplementary material.

      Author response image 1.

      Calibration plot for the experimental setup. Average pupil size (arbitrary units from eyelink device) is plotted against display luminance. The plot is obtained by presenting the participant with uniform full screen displays with 10 different luminance levels covering the entire range of the monitor RGB values (0 to 255) whose luminance was separately measured with a photometer. Each display lasted 10 seconds. Error bars are standard deviation between sessions.

      Moreover, while the authors state that the traces were normalized to a value of 0 at the start of the ITI period, the data displayed in Figure 2 do not show this normalization but different non-zero values. Are these data not normalized, or was a different procedure used? Finally, the authors analyze the pupil signal averaged across a wide temporal ITI interval that may contain stimulus-locked responses (there is not enough information in the manuscript to clearly determine which temporal interval was chosen and averaged across, and how it was made sure that this signal was not contaminated by stimulus effects).

      We have now added the following details to the Methods section in line 1106-1135.

      “In both studies, the Eye movements were recorded by an EyeLink 1000 (SR- Research) device with a sampling rate of 1000Hz which was controlled by a dedicated host PC. The device was set in a desktop and pupil-corneal reflection mode while data from the left eye was recorded. At the beginning of each block, the system was recalibrated and then validated by 9-point schema presented on the screen. For one subject was, a 3-point schema was used due to repetitive calibration difficulty. Having reached a detection error of less than 0.5°, the participants proceeded to the main task. Acquired eye data for pupil size were used for further analysis. Data of one subject in the first study was removed from further analysis due to storage failure.

      Pupil data were divided into separate epochs and data from Inter-Trials Interval (ITI) were selected for analysis. ITI interval was defined as the time between offset of trial (t) feedback screen and stimulus presentation of trial (t+1). Then, blinks and jitters were detected and removed using linear interpolation. Values of pupil size before and after the blink were used for this interpolation. Data was also mid-pass filtered using a Butterworth filter (second order,[0.01, 6] Hz)[50]. The pupil data was z-scored and then was baseline corrected by removing the average of signal in the period of [-1000 0] ms interval (before ITI onset). For the statistical analysis (GLMM) in Figure 2, we used the average of the pupil signal in the ITI period. Therefore, no pupil value is contaminated by the upcoming stimuli. Importantly, trials with ITI>3s were excluded from analysis (365 out of 8800 for study 1 and 128 out 6000 for study 2. Also see table S7 and Selection criteria for data analysis in Supplementary Materials)”

      Fourth, while the EEG analysis in general provides interesting data, the link to the well-established CPP signal is not entirely convincing. CPP signals are usually identified and analyzed in a response-locked fashion, to distinguish them from other types of stimulus-locked potentials. One crucial feature here is that the CPPs in the different conditions reach a similar level just prior to the response. This is either not the case here, or the data are not shown in a format that allows the reader to identify these crucial features of the CPP. It is therefore questionable whether the reported signals indeed fully correspond to this decision-linked signal.

      Fifth, the authors present some effective connectivity analysis to identify the neural mechanisms underlying the possible top-down drive due to communicated confidence. It is completely unclear how they select the "prefrontal cortex" signals here that are used for the transfer entropy estimations, and it is in fact even unclear whether the signals they employ originate in this brain structure. In the absence of clear methodical details about how these signals were identified and why the authors think they originate in the prefrontal cortex, these conclusions cannot be maintained based on the data that are presented.

      Sixth, the description of the model fitting procedures and the parameter settings are missing, leaving it unclear for the reader how the models were "calibrated" to the data. Moreover, for many parameters of the biophysical model, the authors seem to employ fixed parameter values that may have been picked based on any criteria. This leaves the impression that the authors may even have manually changed parameter values until they found a set of values that produced the desired effects. The model would be even more convincing if the authors could for every parameter give the procedures that were used for fitting it to the data, or the exact criteria that were used to fix the parameter to a specific value.

      Seventh, on a related note, the reader wonders about some of the decisions the authors took in the specification of their model. For example, why was it assumed that the parameters of interest in the three competing models could only be modulated by the partner's confidence in a linear fashion? A non-linear modulation appears highly plausible, so extreme values of confidence may have much more pronounced effects. Moreover, why were the confidence computations assumed to be finished at the end of the stimulus presentation, given that for trials with RTs longer than the stimulus presentation, the sensory information almost certainly reverberated in the brain network and continued to be accumulated (in line with the known timing lags in cortical areas relative to objective stimulus onset)? It would help if these model specification choices were better justified and possibly even backed up with robustness checks.

      Eight, the fake interaction partners showed several properties that were highly unnatural (they did not react to the participant's confidence communications, and their response times were random and thus unrelated to confidence and accuracy). This questions how much the findings from this specific experimental setting would transfer to other real-life settings, and whether participants showed any behavioral reactions to the random response time variations as well (since several studies have shown that for binary choices like here, response times also systematically communicate uncertainty to others). Moreover, it is also unclear how the confidence convergence simulated in Figure 3d can conceptually apply to the data, given that the fake subjects did not react to the subject's communicated confidence as in the simulation.

    1. Author Response

      Reviewer #1 (Public Review):

      This work by Shen et al. demonstrates a single molecule imaging method that can track the motions of individual protein molecules in dilute and condensed phases of protein solutions in vitro. The authors applied the method to determine the precise locations of individual molecules in 2D condensates, which show heterogeneity inside condensates. Using the time-series data, they could obtain the displacement distributions in both phases, and by assuming a two-state model of trapped and mobile states for the condensed phase, they could extract diffusion behaviors of both states. This approach was then applied to 3D condensate systems, and it was shown that the estimates from the model (i.e., mobile fraction and diffusion coefficients) are useful to quantitatively compare the motions inside condensates. The data can also be used to reconstruct the FRAP curves, which experimentally quantify the mobility of the protein solution.

      This work introduces an experimental method to track single molecules in a protein solution and analyzes the data based on a simple model. The simplicity of the model helps a clear understanding of the situation in a test tube, and I think that the model is quite useful in analyzing the condensate behaviors and it will benefit the field greatly. However, the manuscript in its current form fails to situate the work in the right context; many previous works are omitted in this manuscript, exaggerating the novelty of the work. Also, the two- state model is simple and useful, but I am concerned about the limits of the model. They extract the parameters from the experimental data by assuming the model. It is also likely that the molecules have a continuum between fully trapped and fully mobile states, and that this continuum model can also explain the experimental data well.

      We thank the reviewer for the warm overview of our work and the insightful comments on the areas that need to be improved. We are very encouraged by the reviewer’s general positive assessment of our approach. We have addressed these comments in the revised manuscript

      Reviewer #2 (Public Review):

      In this paper, Shen and co-workers report the results of experiments using single particle tracking and FRAP combined with modeling and simulation to study the diffusion of molecules in the dense and dilute phases of various kinds of condensates, including those with strong specific interactions as well as weak specific interactions (IDR-driven). Their central finding is that molecules in the dense phase of condensates with strong specific interactions tend to switch between a confined state with low diffusivity and a mobile state with a diffusivity that is comparable to that of molecules in the dilute phase. In doing so, the study provides experimental evidence for the effect of molecular percolation in biomolecular condensates.

      Overall, the experiments are remarkably sophisticated and carefully performed, and the work will certainly be a valuable contribution to the literature. The authors' inquiry into single particle diffusivity is useful for understanding the dynamics and exchange of molecules and how they change when the specific interaction is weak or strong. However, there are several concerns regarding the analysis and interpretation of the results that need to be addressed, and some control experiments that are needed for appropriate interpretation of the results, as detailed further below.

      We thank the reviewer for the warm support of our work (assessing that our work is “remarkably sophisticated and carefully performed” and “will certainly be a valuable contribution”) and for the constructive comments/critiques, which we have now addressed in the revised manuscript (please refer to our detailed responses below).

      (1) The central finding that the molecules tend to experience transiently confined states in the condensed phase is remarkable and important. This finding is reminiscent of transient "caging"/"trapping" dynamics observed in diverse other crowded and confined systems. Given this, it is very surprising to see the authors interpret the single-molecule motion as being 'normal' diffusion (within the context of a two-state diffusion model), instead of analyzing their data within the context of continuous time random walks or anomalous diffusion, which is generally known to arise from transient trapping in crowded/confined systems. It is not clear that interpreting the results within the context of simple diffusion is appropriate, given their general finding of the two confined and mobile states. Such a process of transient trapping/confinement is known to lead to transient subdiffusion at short times and then diffusive behavior at sufficiently long times. There is a hint of this in the inset of Fig 3, but these data need to be shown on log-log axes to be clearly interpreted. I encourage the authors to think more carefully and critically about the nature of the diffusive model to be used to interpret their results.

      We thank the reviewer for the insightful comments and suggestions, which have been very helpful for us to think deeper about the experimental data and the possible underlying mechanism of our findings. Indeed, the phase separated systems studied here resemble previously studied crowed and confined systems with transient caging/trapping dynamics in the literature ((Akimoto et al., 2011; Bhattacharjee and Datta, 2019; Wong et al., 2004) for examples)(references have been added in the revised manuscript). In our PSD system in Figure 3, The caging/trapping of NR2B in the condensed phase is likely due to its binding to the percolated PSD network. Thus, NR2B molecules in the condensed phase should undergo subdiffusive motions. Indeed, from our single molecule tracking data, the motion of NR2B fits well with the continuous time random walk (CTRW) model, as surmised by this reviewer. We have now fitted the MSD curve of all tracks of NR2B in the condensed phase with an anomalous diffusion model: MSD(t)=4Dtα (see Response Figure 1 below). The fitted α is 0.74±0.03, indicating that NR2B molecules in the condensed phase indeed undergo sub- diffusive motions. The fitted diffusion coefficient D is 0.014±0.001 μm2/s. We have now replaced the Brownian motion fitting in Figure 3E in the original manuscript with this sub- diffusive model fitting in the revised manuscript to highlight the complexity of NR2B diffusion in PSD condensed phase we observed.

      Response Figure 1: Fitted the MSD curve (mean value as red dot with standard error as error bar) in condensed phase with an anomalous diffusion model (blue curve, MSD=4Dtα). The fitting gives D=0.014±0.001 μm2/s and α=0.74±0.03.

      We find it useful to interpret the apparent diffusion coefficient (D=0.014±0.001 μm2/s) derived from this particular anomalous diffusion model as containing information of NR2B motions in a broadly construed mobile state (i.e., corresponding to the network unbound form) as well as in a broadly construed confined state (i.e., corresponding to NR2B molecules bound to percolated PSD networks). The global fitting using the sub-diffusive model does not pin down motion properties of NR2B in these different motion states. This is why we used, at least as a first approximation, the two-state motion switch model (HMM model) to analyse our data (please refer also to our detailed response to the comment #7 from reviewer 1 and corresponding additional analyses made during the revision as highlighted in Response Figure 4).

      As described in our response to the comment points #4 and #7 from reviewer 1, the two- state model is most likely a simplification of NR2B motions in the condensed phase. Both the mobile state and the confined state in our simplified interpretative framework likely represent ensemble averages of their respective motion states. However, the tracking data available currently do not allow us to further distinguish the substates, but further analysis using more refined model in the future may provide more physical insight, as we now emphasize in the revised “Discussion” section: “With this in mind, the two motion states in our simple two-state model for condensed-phase dynamics should be understood to be consisting of multiple sub-states. For instance, one might envision that the percolated molecular network in the condensed phase is not uniform (e.g., existence of locally denser or looser local networks) and dynamic (i.e., local network breaking and forming). Therefore, individual proteins binding to different sub-regions of the network will have different motion properties/states. … In light of this basic understanding, the “confined state” and “mobile state” as well as the derived diffusion coefficients in this work should be understood as reflections of ensemble-averaged properties arising from such an underlying continuum of mobilities. Further development of experimental techniques in conjunction with more refined models of anomalous diffusion (Joo et al., 2020; Kuhn et al., 2021; Muñoz-Gil et al., 2021) will be necessary to characterize these more subtle dynamic properties and to ascertain their physical origins” (p.23 of the revised manuscript).

      A practical reason for using the two-state motion switch HMM model to analyse our tracking data in the condensed phase is that the lifetime of the putative mobile state (when the per-frame molecular displacements are relatively large) is very short and such relatively faster short trajectories are interspersed by long confined states (see Response Figure 4C for an example). Statistically, ascertaining a particular anomalous diffusion model by fitting to such short tracks is likely not reliable. Therefore, here we opted for a semi-quantitative interpretative framework by using fitted diffusion coefficients in a two-state HMM as well as the new correlation-based approach for demarcating a low-mobility state and a high- mobility state (see our detailed response to reviewer 1’s point #7) in the present manuscript (which is quite an extensive study already) while leaving refinements of our computational modelling to future effort.

      Even in the context of the 'normal' two-state diffusion model they present, if they wish to stick with that-although it seems inappropriate to do so-can the authors provide some physical intuition for what exactly sets the diffusivities they extract from their data. (0.17 and 0.013 microns squared per second for the mobile and confined states). Can these be understood using e.g., the Stoke-Einstein or Ogston models somehow?

      As stated above, we are in general agreement with this reviewer that the motion of NR2B in the condensed phase is more complex than the simple two-state picture we adopted as a semi-quantitative interpretation that is adequate for our present purposes. Within the multi-pronged analysis we have performed thus far, NR2B molecules clearly undergo anomalous diffusions in solution containing dense, percolated, and NR2B-binding molecular networks. As a first approximation, our simple two-state HMM analysis yielded two simple diffusion coefficients (0.17 μm2/s for the mobile state and 0.013 μm2/s for the confined state). For the diffusion coefficient in the mobile state, we regard it as providing a time scale for relatively faster diffusive motions (which may be further classified into various motion substates in the future) that are not bound or only weakly associated with the percolated network of strong interactions in the PSD condensed phase. For the confined or low-mobility state in our present formulation, these molecules are likely bound relatively tightly to the percolated networks, thus the diffusion coefficient should be much smaller than the unbounded form (i.e., the mobile state) according to the Stoke-Einstein model. However, due to the detection limitation of the supper resolution imaging method (resolution of ~20 nm), we could not definitively tell the actual diffusivity beyond the resolution limit. So the diffusion coefficient in the confined state can also be interpreted as a Gaussian distributed microscope detection error (𝑓(𝑥) =1 , which is x~N(0, σ2), where σ is the standard deviation of the Gaussian distribution viewed as the resolution of localization-based microscopy, x is the detection error between recorded localization and molecule’s actual position). The track length in the confined state is the distance between localizations in consecutive frames, which can be calculated by subtraction of two independent Gaussian distributions, and the distribution of this track length (r) will be r~N(0, 2σ2). To link the detection error with the fitted diffusion coefficient, we calculated the log likelihood function of Gaussian distributed localization error (, where σ is the standard deviation of the Gaussian distribution) for the maximum likelihood estimation process to fit the HMM model. The random walk shares a similar log likelihood term () in performing maximum likelihood estimation.

      These two log likelihood functions will produce same fitting results with 2σ2 equivalent to 4Dt according to the likelihood function. In this way, the diffusion coefficient yielded by our HMM analyses for the confined state (0.0127 μm2/s) can be interpreted as the standard deviation of localization detection error (or microscope resolution limit), which is 𝜎 =√2𝐷𝑡 = 19.5 𝑛𝑚. We have included this consideration as an alternate interpretation of the confined-state or low-mobility motions with the results now provided in the “Materials and Methods” section in the sentence, viz., “… the L-component distribution may be reasonably fitted (albeit with some deviations, see below) to a simple-diffusion functional form with a parameter s =13.6 ± 3.7 nm, where s may be interpreted as a microscope detection error due to imaging limits or alternately expressed as s = DLt with DL = 0.006149 μm2/s being the fitted confined-state diffusion coefficient and t = 0.03s is the time interval of the time step between experimental frames. (The HMM-estimated confined-state Dc = 0.0127 μm2/s corresponds to s = 19.5 nm)” (p.32 of the revised manuscript).

      (2) Equation 1 (and hence equation 2) is concerning. Consider a limit when P_m=1, that is, in the condensed phase, there are no confined particles, then the model becomes a diffusion equation with spatially dependent diffusivity, \partial c /\partial t = \nabla * (D(x) \nabla c). The molecules' diffusivity D(x) is D_d in the dilute phase and D_m in the condensed phase. No matter what values D_d and D_m are, at equilibrium the concentration should always be uniform everywhere. According to Equation 1, the concentration ratio will be D_d/D_m, so if D_d/D_m \neq 1, a concentration gradient is generated spontaneously, which violates the second law of thermodynamics. Can the authors please justify the use of this equation?

      Indeed, the derivation of Equation 1 appears to be concerning. The flux J is proportional to D * dc/dx (not kDc as in the manuscript). At equilibrium dc/dx = 0 on both sides and c is constant everywhere. Can the authors please comment?

      So then another question is, why does the Monte Carlo simulation result agree with Equation 1? I suspect this has to do with the behavior of particles crossing the boundary. Consider another limit where D_m = 0, that is, particles freeze in the condensed phase. If once a particle enters the condensed phase, it cannot escape, then eventually all particles will end up in the condensed phase and EF=infty. The authors likely used this scheme. But as mentioned above this appears to violate the second law.

      Thanks for the incisive comment. After much in-depth considerations, we are in agreement with the reviewer that Eq.1 should not be presented as a relation that is generally applicable to diffusive motions of molecules in all phase-separated systems. There are cases in which this relation can need to unphysical outcomes as correctly pointed out by the reviewer.

      Nonetheless, based on our theoretical/computational modeling, it is also clear, empirically, that Eq.1 holds approximately for the NR2B/PSD system we studied, and as such it is a useful approximate relation in our analysis. We have therefore provided a plausible physical perspective for Eq.1’s applicability as an approximate relation based upon a schematic consideration of diffusion on an underlying rugged (free) energy landscape (Zhang and Chan, 2012) of a phase-separated system (See Figure 3G in the revised manuscript), while leaving further studies of such energy landscape models to future investigations.

      This additional perspective is now included in the following added passage under a new subheading in the revised manuscript:

      "Physical picture and a two-state, two-phase diffusion model for equilibrium and dynamic properties of PSD condensates"

      (3) Despite the above two major concerns described in (1) and (2), the enrichment due to the presence of a "confined state", is reasonable. The equilibrium between "confined" and "mobile" states is determined by its interaction with the other proteins and their ratio at equilibrium corresponds to the equilibrium constant. Therefore EF=1/Pm is reasonable and comes solely from thermodynamics. In fact, the equilibrium partition between the dilute and dense phases should solely be a thermodynamic property, and therefore one may expect that it should not have anything to do with diffusivity. Can the authors please comment on this alternative interpretation?

      Thanks for this thought-provoking comment. We agree with the reviewer that the relative molecular densities in the condensed versus dilute phases are governed by thermodynamics unless there is energy input into the system. However, in our formulation, the mobile ratio should not be the only parameters for determining the enrichment fold in a phase separated system. In fact, the approximate relation (Eq.1) is EF ≈ Dd/PmDm, and thus EF ≈ 1/Pm only when Dd ≈ Dm . But the speed of mobile-state diffusion in the condensed phase is found to be appreciably smaller than that of diffusion in the dilute phase (Dd > Dm). In general, a hallmark of a phase separation system is to enrich involved molecules in the condensed phase, regardless whether the molecule is a driver (or scaffold) or a client of the system. Such enrichment is expected to be resulted from the net free energy gain due to increased molecular interactions of the condensed phase (as envisioned in Response Figure 9). For example, in the phase separation systems containing PrLD-SAMME (Figure 4 of the manuscript), Pm is close to 1, but the enrichment of PrLD-SAMME in the condensed phase is much greater than 1 (estimated to be ~77, based on the fluorescence intensity of the protein in the dilute and condensed phase; Figure 5—figure supplement 1). As far as Eq.1 is concerned, this is mathematically correct because the diffusion coefficient of PrLD-SAMME in the condensed phase (D ~0.2 μm2/s) is much smaller than the diffusion coefficient of a monomeric molecule with a similar molecular mass in dilute solution (D~ 100 μm2/s, measured by FRAP-based assay; the mobility of the molecules in the dilute solution in 3D is too fast to be tracked). Physically, it’s most likely that the slower molecular motion in the condensed phase is caused by favorable intermolecular interactions and the same favorable interactions underpinning the dynamic effects lead also to a larger equilibrium Boltzmann population.

    1. Author Response

      Reviewer #1 (Public Review):

      Nandan et al. attempt to demonstrate how a phenomenology in the molecular signaling network inside a cell could translate to changes in the behavior of the cell and its ability to respond/adapt to changes in the environment over time and space. While this investigation is performed in the context of mammalian cells, the result holds significance for eukaryotic cells at large and demonstrates a mechanism by which cells may use transient memory states to respond robustly to complex environmental cues. To study such mechanisms, it is important to show how the cell may encode such transient memory, how this memory is generated from environmental cues, how it translates to cellular motion, and how it enables cells to have persistent directional motion in the case of transient disruptions in the signal while responding to significant and long-lasting disruptions. The authors attempt to answer all of these questions.

      Strengths:<br /> The manuscript attempts to combine mathematical theory, mechano-chemical models, numerical simulations, and experimental evidence. Thus, the investigation spans diverse methods and spatio-temporal scales (from receptors to continuum mechanical models to whole-cell motion) to answer a unified question. The mathematical theory of dynamic states and bifurcation theory provides the basis for the generation of "ghost" states that can encode transient memory; the mechano-chemical models show how such dynamical states can be realized in the EGFR signaling network; the numerical simulations show both how cells can respond to environmental cues by generating polarised states, and by navigating complex environmental cues, and experiments provide evidence that this may be the case for epithelial cells in the presence of growth factors. The manuscript is well-structured with the main conclusions clearly identified and separated from each other in the different sections. The theoretical investigation is thorough and the main text provides an intuition as to what the authors are trying to convey, while the Methods reveal the calculations performed and the approximations made. The modeling and numerical simulations are detailed and provide a baseline expectation for the system in different parameter regimes. The experiments and the analysis extensively characterize the system. I commend the authors for having delved into so many methods to answer this problem, and the authors demonstrate significant knowledge of the different methods with many novel contributions.

      Weaknesses:<br /> The key weakness of the results is in establishing clear distinctions between what would be expected (naively and based on results from other groups) from alternate explanations, and what is realized in the experimental results that support the hypothesis put forward by the authors. For example, the authors quote a relatively long time scale of persistence of polarisation, but it is unclear if this is longer than is expected from slow dephosphorylation to provide evidence for the existence of the "ghost" state from the saddle-node bifurcation. Further, key experimental results regarding the persistence of motion following gradient washout seem to differ from the authors' own predictions from simulations.<br /> There are several other models that attempt to describe eukaryotic chemotactic motion that persists despite brief disruptions and is able to adapt to changes in the environment over longer timescales. In my opinion, the main strength of the paper does not lie in providing another such model, but in providing a mechanistic understanding that bridges several scales. However, this places the burden on the authors to justify the link between the different scales.<br /> This is an ambitious manuscript and the authors are clearly very bold for attempting such a comprehensive treatment of such a complex system. The authors provide an excellent framework to understand mammalian cellular chemotaxis on multiple scales and attempt to justify the framework using several experiments and extensive analysis. However, they require further analysis and characterization to demonstrate that their experimental results provide the necessary justification for their conclusions as opposed to alternate possibilities.

      We thank the referee for his/her in-depth suggestions and valuable comments how to improve the manuscript, that we implemented in details in the amended version. We have especially focused on providing the necessary justification for working memory emerging from a “ghost” signaling state as opposed to slow dephosphorylation mechanism. For this, we have fitted the single-cell EGFRp temporal profiles after gradient wash-out with and without Lapatnib inhibition, with an inverse sigmoid function and quantified the respective half-life and the Hill coefficient. The analysis included in the new Figure 2 – figure supplement 2 shows that under Lapatinib treatment which inhibits the kinase activity of the receptor and thereby the dynamics of the system is guided by the dephosphorylating activity of the phosphatases, the system relaxes to the basal state in an almost exponential process (half-life ~10min., Hill coefficient ~1.3). In contrast, under normal conditions EGFR phosphorylation relaxes to the basal state in ~30min, corroborating that the system remains trapped in the “ghost” state. Moreover, the transition from the memory to the basal state is rapid, as reflected in an estimated Hill coefficient ~ 3. Additionally, we also discuss how the identified slow-time scale that emerges from the “ghost” state serves as a possible mechanistic link between the rapid phosphorylation/de-phosphorylation events and the ~40min of memory in cell shape polarization/directional cell migration after growth factor removal.

      Moroever, we include additional quantification of memory in single-cell directional motility in the cases with and without EGFR inhibitor (Figure 3 – figure supplement 3), and relate these results to previously proposed mechanisms on memory in directional migration from cytoskeletal asymmetries, but also highlight the importance of memory in polarized receptor signaling as a necessary means to couple cellular processes that occur on different time-scales. We have further expanded the manuscript by providing theoretical predictions how the organization at criticality uniquely enables resolving simultaneous signals. We address the referee’s comments as outlined below:

      Reviewer #2 (Public Review):

      Nandan, Das et al. set out to study the mechanism by which single cells are able to follow extracellular signals in variable environments generate persistent directional migration in the presence of changing chemoattractant fields. Importantly, cells are able to (1) maintain the orientation acquired during the initial signal despite disruptions or noise while still (2) adapting migrational direction in response to newly-encountered signals. Previous models have accounted for either of these properties, but not both simultaneously. To reconcile these observations, this work proposes an underlying mechanism in which cells utilize a form of working memory.

      The authors present a dynamical systems framework in which the presence of dynamical 'ghosts' in an underlying signaling network allow the cell to retain a memory of previously encountered signals. These are generated as follows: a pitchfork bifurcation confers a symmetry-breaking transition from a non-polarised to polarised signaling state/ direction-oriented cell shape. After a subsequent saddle-node bifurcation, a 'ghost' of the stable attractor emerges. This 'ghost' state is metastable, however, which is what allows cells to integrate new signals as well as to adapt their direction of migration.

      The authors demonstrate these dynamics in the Epidermal Growth Factor Receptor (EGFR) signaling network. This pathway is central in many embryonic and adult processes conserved in most animal groups, making it an ideal choice to characterise a phenomenon observed in such a diverse range of cells. The authors couple a mechanical model of the cell with the biochemical signaling model for EGFR, which nicely allows them to thoroughly simulate cellular deformations that they predict will occur during polarization and motility.

      Key features of the model are well-supported by empirical data from experiments: (1) quantitative live-cell imaging of polarised EGFR signaling shows the existence of a distinct polarised 'ghost' state after removal of extracellular signals and (2) motility experiments confirm the manifestation of this memory in allowing for persistent cell migration upon loss of a signal. In an extension of the latter experiment, the authors also show that cells displaying this working memory are still able to respond to changes in the chemoattractant field as necessary.

      The experiments using Lapatinib to disrupt the EGFR dynamics are less convincing. The authors show that subjecting cells to this inhibitor results in the absence of memory and removes the ability of cells to maintain their orientation after the gradient was disrupted. Clarification of which aspect(s) of the EGFR network within the context of the model are precisely disrupted by Lapatinib would be helpful in strengthening the authors' claims here that it is the mechanism of working memory and not other features of the EGFR network, that is responsible for the results shown.

      We thank the referee for the detailed comments and suggestions that helped us to improve the manuscript. In the amended version of the manuscript, we describe that Lapatinib hinders EGFR kinase activity, thus in the model, this will mainly affect the autocatalytic rate constant. We have performed numerical simulations where the autocatalytic rate constant is decreased after gradient removal, and show that the EGFRp temporal profile shows a slow decay after gradient removal, whereas the state-space trajectory directly transits from the polarized to the basal state without intermidate state-space trapping, thereby qualitatively resembling the experimental observations under Lapatinib treatment (compare Figure 2 – figure supplement 2C, D with Figure 2G in the amended version of the manuscript).

      Reviewer #3 (Public Review):

      Cell navigation in chemoattractant fields is important to many physiological processes, including in development and immunity. However, the mechanisms by which cells break symmetry to navigate up concentration gradients, while also adapting to new gradient directions, remain unclear. In this study, the authors propose a new theoretical model for this process: cells are poised near a subcritical pitchfork bifurcation, which allows them to simultaneously maintain the memory of a polarized state over intermediate timescales and respond to new cues. They show analytically that a model of EGFR phosphorylation dynamics has a subcritical pitchfork bifurcation, and use simulations of in silico cells to demonstrate both memory and adaptability in this system. They further measure EGFR phosphorylation profiles, as well as migration tracks under external gradients, in real cells.

      This work contributes an interesting new theoretical framework, bolstered by substantial analysis and simulations, as well as valuable measurements of cell behavior and polarization. Both the modeling and the measurements are careful and thorough, and each represents a substantial contribution to decoding the complex problem of cell navigation. The measurements support and quantify the phenomenon of directional memory. The main weakness is that it is not clear that they also support the mechanism proposed by the model.

      Theoretical framework

      One of the main strengths of this work is the thorough theoretical analysis of a model of symmetry breaking in EGFR phosphorylation. The authors perform linear stability analysis and a weakly nonlinear amplitude equation analysis to characterize the transition. Additionally, they convincingly demonstrate in simulations that this model can generate robust polarization, with memory over intermediate timescales and responsiveness to new gradient directions. However, the relationship between the full dynamical system and the bifurcation diagrams shown in Figure 1A and Figure 1-Figure Supplement 1B is not clear. In particular, there is an implicit reduction from an infinite dimensional system (continuous in space) to an ODE system.<br /> From Methods 5.15, it appears that this was accomplished by approximating the continuous cell perimeter as a diffusively-coupled two-component system, representing the left and right halves of the cell (Methods 5.15 Equation 18 to Equation 19). However, this is not stated explicitly in the methods, and not at all in the main text, making the argument difficult to follow. Additionally, the main text and methods describe the emergence of an unstable odd spatial eigenmode as the key requirement for the pitchfork bifurcation. It is not clear why it is sufficient to show this emergence in the two-component system.

      We thank the referee for the detailed and insightful comments which we implemented in details in the amended version of the manuscript. Indeed, as the referee commented, we have assumed a simplified one-dimensional geometry composed of two compartments (front and back), resembling a projection of the membrane along the main diagonal of the cell. The standard approach of modeling the diffusion along the membrane in this case is simple exchange of the diffusing components. The one-dimensional projection, as demonstrated in the analysis, preserves all of the main features of the PDE model. The numerical bifurcation analysis was only performed for comparative purposes. In the amended version of the manuscript we thus extend the description of this simplification, as well as the purpose of its implementation. Additionally, one of the reasons for developing the theoretical network for us was to provide a method how subcritical PB can be identified in general in PDE models.

      The schematic of the bifurcation in Figure 1A / now in Figure 1 – figure supplement 1A, as well as the numerical bifurcation analysis of the EGFR model in Figure 1-Supplement 1C represent a subcritical pitchfork bifurcation, but the alignment of IHSS branches is slightly different in the EGFR model. This however has no influence on the full dynamics of the system, or the proposed hypothesis. Moreover, in order to explain in details the dynamical transitions - how the unfolding of the PB results in robust polarization and how the organization at criticality enables temporal memory in polarization to be maintained, we included a revised schematic in Figure1 – figure supplement 1A that shows the signal induced transitions that were previously depicted in a compact way in Figure1A, and included respective description in Methods, Section 5.15. The corresponding transitions for the one-dimensional projection EGFR model is also included in the detailed response (Figure 2) for comparison.

      Relationship between the measurements and model

      The second main strength of this work is the contribution of controlled measurements of cell motility, polarization, and phosphorylated EGFR profiles. The measurements of cell migration presented here support the claim that the cells have a memory of past gradients. Additionally, the authors contribute very nice quantifications of the memory timescale. The Lapatinib experiments also support the claim that this memory is related to EGFR activity. However, there are a number of ways in which the real cells appear not to behave like the in silico cells. Polarization in phosphorylated EGFR is present only some of the time in the data, and if present, appears to be weak and/or variable, in magnitude and direction (phosphorylated EGFR profiles, figure 2C, Figure 2-Figure supplement 1D, E). Even for the subset of cells that display polarized EGFR phosphorylation profiles, the average profile is shown after aligning to the peak for each cell (Figure 2-Figure Supplement 1C), so it is not clear that they polarize in the direction of the gradient.

      We thank the referee for these comments which we used as a basis to improve the presentation of the results in the amended version of the manuscript. In order to demonstrate that cells polarize in the direction of the maximal EGF concentration, we have used the EGF647 intensity to quantify the growth factor distribution around each cell and calculated the angle between the maximum of the EGF647 distribution and projection of EGFRp spatial distribution (summarized in Figure 2 – figure supplement 1F and Methods). In brief, for quantification of EGF647 distribution outside each cell, the cell masks were extended by 23 pixels, and the outer rim of 15 pixels was used for the quantification. A radial histogram of the obtained angles confirms that the polarization of EGFRp is in the direction of maximal EGF647, with the variability arising from the positioning of the cells within the gradient chamber. That cells polarize in direction of the gradient can be indirectly inferred also from the migration data (Fig. 3C), where we have estimated the projection of the relative displacement angles with respect to the gradient direction. The cos 𝜃 values during and for ~50min after gradient removal are maintained around 1 (cells migrate in direction of the gradient), before re-setting to 0, which is characteristic for the no-stimulus case.

      The length of the memory in EGFRp polarization is indeed variable in single cells, being on average ~40-50min. The length of the memory is directly related to the total EGFR concentration on the plasma membrane – the closer EGFRt is to the value for which the SNPB is exhibited, the longer the duration of the memory is, and in theory

      𝑀𝑒𝑚𝑜𝑟𝑦 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 ∝ 𝐸𝐺𝐹𝑅𝑡1/2. From the experimental measurments we have indeed observed a correlation between these two quantities, which we include here for the referee’s perusal (Figure 1). However, direct fitting to the experimental data with the given dependency could not be performed because of the following reasons: In general, the fitting function is 𝑓(𝐸𝐺𝐹𝑅𝑇) = 𝑐 ∗ (𝑐𝐸𝐺𝐹𝑅𝑇,𝑆𝑁−𝐸𝐺𝐹𝑅𝑇)n, where c= const. and 𝑐𝐸𝐺𝐹𝑅𝑇,𝑆𝑁 is the total EGFR concentration at the plasma membrane that marks the position of the SNPB. This value however cannot be identified with certainty from the experiments. Thus, we have chosen a fixed value based on the spread of the data and in this case, the fitting resulted to n = 0.49, which approximates well the theoretical value. However, since one of the parameters must be arbitrarily chose, we refrain from presenting the fit.

      *Figure 1: Correlation between single-cell transient memory duration and plasma membrane abundance of 𝐸𝐺𝐹𝑅𝑚𝐶𝑖𝑡𝑟𝑖𝑛𝑒. *

      The real cells also appear to track the gradient far less reliably than the in silico cells (e.g. Figure 4B vs. 4C). Thus the measurements demonstrate and quantify the phenomenon of directional memory, but it is not clear that they support the mechanism proposed by the model, i.e. a symmetry-breaking transition in phosphorylated EGFR.

      We would like to emphasize here that the symmetry-breaking transition via a subcritical pitchfork bifurcation gives rise to robust polarization in the direction of the growth factor signal, whereas critical organization at the SNPB – temporal memory of the polarized state, as well as capability for integration of signals that change both over time and space. The analytical as well as the numerical analysis of the experimentally identified EGFR network verifies that this network exhibits a subcritical PB. In the amended version of the manuscript, we have also included quantification of the directionality of polarization (Figure 2 – figure supplement 1F).

      We would like to note however, that the difference between the simulations and the experiments in Figure 4 lies in the fact that the directional migration in the physical model of the cell, due to the complexity of connecting the signaling with the physical model, is realized as a ballistic movement, whereas experimentally we have identified that cells perform persistent biased random walk (Figure 3D). In the amended version of the manuscript we have discussed these differences in relation to Fig.4.

      Moreover, in the experiments, the EGF647 gradient is established from the top of the microfluidic chamber, and therefore there will be variability due to the position of cells within the chamber, the disruption of the gradient due to the presence of neighboring cells etc. The single cell trajectories (several examples shown in Figure 4 – figure supplement 1F) and the quantification of the relative displacement angles (Figure 4D,E) however clearly depict that cells migrate in the gradient direction and rapidly adapt to the changes in the external cues.

      Additionally, in the authors' model, the features of memory and adaptability in cell navigation depend on the system being poised near a critical point. Thus, in silico, the sensing system 'breaks' when the system parameters are moved away from this point. In particular, cells with increased receptor concentration on their surface cannot adapt to new gradient directions (Section 1, final paragraph; Figure 1-Figure Supplement 1E-G). Based on this, the authors' theoretical framework makes a nonintuitive prediction: overexpression of the surface receptor EGFR in real cells should render them insensitive to changes in the concentration gradient. The fact that the model suggests a surprising, testable prediction is a strength of the framework. A weakness is that the consistency of this prediction with empirical data is not discussed (though the authors note similarities between this regime and unrealistic features of previous models).

      The organization at criticality is indeed dependent on the total concentration of receptors at the plasma membrane. The trafficking of the epidermal growth factor receptors has been previously characterized in details and demonstrated that the ligandless receptors continuously recycle to the plasma membrane, whereas the ligandbound receptors are unidirectionally removed and are trafficked to the lysosome where they await degradation [5]. Thus, how quickly the system will move away from criticality depends directly on the dose and the duration of the EGF stimulus, as this is directly proportional to the fraction of liganded receptors; whereas re-setting of the system at criticality will be afterwards depended on the time scale for biosynthesis of new receptors [17].<br /> Overexpression of EGFR receptors will cause the system to display either permanent polarization (organization in the stable IHSS state) or uniform activation (high HSS branch). We have tested numerically the features of the system when it displays permanent memory (Figure 4 – figure supplement 1C,D) and demonstrated that in this case, cells are not able to resolve signals from opposite directions and therefore migration will be halted. Additionally we also now tested numerically the capability of the cells for resolving simultaneous signals with different amplitudes from opposite direction, and demonstrate that permanent memory as resulting from receptor organization hinders the cells in this comparison task, in contrast to organization at criticality (Figure 4 – figure supplement 2). In the amended version of the manuscript we included a discussion of these points raised by the referee and hope that this allows for more clear presentation of our findings and their implications.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors set out to extend modeling of bispecific engager pharmacology through explicit modelling of the search of T cells for tumour cells, the formation of an immunological synapse and the dissociation of the immunological synapse to enable serial killing. These features have not been included in prior models and their incorporation may improve the predictive value of the model.

      Thank you for the positive feedback.

      The model provides a number of predictions that are of potential interest- that loss of CD19, the target antigen, to 1/20th of its initial expression will lead to escape and that the bone marrow is a site where the tumour cells may have the best opportunity to develop loss variants due to the limited pressure from T cells.

      Thank you for the positive feedback.

      A limitation of the model is that adhesion is only treated as a 2D implementation of the blinatumomab mediated bridge between T cell and B cells- there is no distinct parameter related to the distinct adhesion systems that are critical for immunological synapse formation. For example, CD58 loss from tumours is correlated with escape, but it is not related to the target, CD19. While they begin to consider the immunological synapse, they don't incorporate adhesion as distinct from the engager, which is almost certainly important.

      We agree that adhesion molecules play critical roles in cell-cell interaction. In our model, we assumed these adhesion molecules are constant (or not showing difference across cell populations). This assumption made us to focus on the BiTE-mediated interactions.

      Revision: To clarify this point, we added a couple of sentences in the manuscript.

      “Adhesion molecules such as CD2-CD58, integrins and selectins, are critical for cell-cell interaction. The model did not consider specific roles played by these adhesion molecules, which were assumed constant across cell populations. The model performed well under this simplifying assumption”.

      In addition, we acknowledged the fact that “synapse formation is a set of precisely orchestrated molecular and cellular interactions. Our model merely investigated the components relevant to BiTE pharmacologic action and can only serve as a simplified representation of this process”.

      While the random search is a good first approximation, T cell behaviour is actually guided by stroma and extracellular matrix, which are non-isotropic. In a lymphoid tissue the stroma is optimised for a search that can be approximated as brownian, or more accurately, a correlated random walk, but in other tissues, particularly tumours, the Brownian search is not a good approximation and other models have been applied. It would be interesting to look at observations from bone marrow or other sites to determine the best approximating for the search related to BiTE targets.

      We agree that the tissue stromal factors greatly influence the patterns of T cell searching strategy. Our current model considered Brownian motion as a good first approximation for two reasons: 1) we define tissues as homogeneous compartments to attain unbiased evaluations of factors that influence BiTE-mediated cell-cell interaction, such as T cell infiltration, T: B ratio, and target expression. The stromal factors were not considered in the model, as they require spatially resolved tissue compartments to represent the gradients of stromal factors; 2) our model was primarily calibrated against in vitro data obtained from a “well-mixed” system that does not recapitulate specific considerations of tissue stromal factors. We did not obtain tissue-specific data to support the prediction of T cell movement. This is under current investigation in our lab. Therefore, we are cautious about assuming different patterns of T cell movement in the model when translating into in vivo settings. We acknowledged the limitation of our model for not considering the more physiologically relevant T-cell searching strategies.

      Revision: In the Discussion, we added a limitation of our model: “We assumed Brownian motion in the model as a good first approximation of T cell movement. However, T cells often take other more physiologically relevant searching strategies closely associated with many stromal factors. Because of these stromal factors, the cell-cell encounter probabilities would differ across anatomical sites.”

      Reviewer #3 (Public Review):

      Liu et al. combined mechanistic modeling with in vitro experiments and data from a clinical trial to develop an in silico model to describe response of T cells against tumor cells when bi-specific T cell engager (BiTE) antigens, a standard immunotherapeutic drug, are introduced into the system. The model predicted responses of T cell and target cell populations in vitro and in vivo in the presence of BiTEs where the model linked molecular level interactions between BiTE molecules, CD3 receptors, and CD19 receptors to the population kinetics of the tumor and the T- cells. Furthermore, the model predicted tumor killing kinetics in patients and offered suggestions for optimal dosing strategies in patients undergoing BiTE immunotherapy. The conclusions drawn from this combined approach are interesting and are supported by experiments and modeling reasonably well. However, the conclusions can be tightened further by making some moderate to minor changes in their approach. In addition, there are several limitations in the model which deserves some discussion.

      Strengths

      A major strength of this work is the ability of the model to integrate processes from the molecular scales to the populations of T cells, target cells, and the BiTE antibodies across different organs. A model of this scope has to contain many approximations and thus the model should be validated with experiments. The authors did an excellent job in comparing the basic and the in vitro aspects of their approach with in vitro data, where they compared the numbers of engaged target cells with T cells as the numbers of the BiTE molecules, the ratio of effector and target cells, and the expressions of the CD3 and CD19 receptors were varied. The agreement with the model with the data were excellent in most cases which led to several mechanistic conclusions. In particular, the study found that target cells with lower CD19 expressions escape the T cell killing.

      The in vivo extension of the model showed reasonable agreements with the kinetics of B cell populations in patients where the data were obtained from a published clinical trial. The model explained differences in B cell population kinetics between responders and non-responders and found that the differences were driven by the differences in the T cell numbers between the groups. The ability of the model to describe the in vivo kinetics is promising. In addition, the model leads to some interesting conclusions, e.g., the model shows that the bone marrow harbors tumor growth during the BiTE treatment. The authors then used the model to propose an alternate dosage scheme for BiTEs that needed a smaller dose of the drug.

      Thank you for the positive comments.

      Weaknesses

      There are several weaknesses in the development of the model. Multiscale models of this nature contain parameters that need to be estimated by fitting the model with data. Some these parameters are associated with model approximations or not measured in experiments. Thus, a common practice is to estimate parameters with some 'training data' and then test model predictions using 'test data'. Though Supplementary file 1 provides values for some of the parameters that appeared to be estimated, it was not clear which dataset were used for training and which for test. The confidence intervals of the estimated parameters and the sensitivity of the proposed in vivo dosage schemes to parameter variations were unclear.

      We agree with the reviewer on the model validation.

      Revision: To ensure reproducibility, we summarized model assumptions and parameter values/sources in the supplementary file 1. To mimic tumor heterogeneity and evolution process, we applied stochastic agent-based models, which are challenging to be globally optimized against the data. The majority of key parameters was obtained or derived from the literature. Details have been provided in the response to Reviewer 3 - Question 1. In our modeling process, we manually optimized sensitive coefficient (β) for base model using pilot in-vitro data and sensitive coefficient (β) for in-vivo model by re-calibrating against the in-vitro data at a low BiTE concentration. BiTE concentrations in patients (mostly < 2 ng/ml) is only relevant to the low bound of the concentration range we investigated in vitro (0.65-2000 ng/ml). We have added some clarification/limitation of this approach in the text (details are provided in the following question). We understand the concerns, but the agent-based modeling nature prevent us to do global optimization.

      The model appears to show few unreasonable behaviors and does not agree with experiments in several cases which could point to missing mechanisms in the model. Here are some examples. The model shows a surprising decrease in the T cell-target cell synapse formation when the affinity of the BiTEs to CD3 was increased; the opposite should have been more intuitive. The authors suggest degradation of CD3 could be a reason for this behavior. However, this probably could be easily tested by removing CD3 degradation in the model. Another example is the increase in the % of engaged effector cells in the model with increasing CD3 expressions does not agree well with experiments (Fig. 3d), however, a similar fold increase in the % of engaged effector cells in the model agrees better with experiments for increasing CD19 expressions (Fig. 3e). It is unclear how this can be explained given CD3 and CD19 appears to be present in similar copy numbers per cell (~104 molecules/cell), and both receptors bind the BiTE with high affinities (e.g., koff < 10-4 s-1).

      Thank you for pointing this out. The bidirectional effect of CD3 affinity on IS formation is counterintuitive. In a hypothetical situation when there is no CD3 downregulation, the bidirectional effect disappears (as shown below), consistent with our view that CD3 downregulation accounts for the counterintuitive behavior. We have included the simulation to support our point. From a conceptual standpoint, the inclusion of CD3 degradation means the way to maximize synapse formation is for the BiTE to first bind tumor antigen, after which the tumor-BiTE complex “recruits” a T cell through the CD3 arm.

      We agree that the model did not adequately capture the effect of CD3 expression at the highest BiTE concentration 100 ng/ml, while the effects at other BiTE concentrations were well captured (as shown below, left). The model predicted a much moderate effect of CD3 expression on IS formation at the highest concentration. This is partly because the model assumed rapid CD3 downregulation upon antibody engagement. We did a similar simulation as above, with moderate CD3 downregulation (as shown below, right). This increases the effect of CD3 expression at the highest BiTE concentration, consistent with experiments. Interestingly, a rapid CD3 downregulation rate, as we concluded, is required to capture data profiles at all other conditions. Considering BiTE concentration at 100 ng/ml is much higher than therapeutically relevant level in circulation (< 2 ng/ml), we did not investigate the mechanism underlying this inconsistent model prediction but we acknowledged the fact that the model under-predicted IS formation in Figure 3d. Notably, this discrepancy may rarely appear in our clinical predictions as the CD3 expression is low level and blood BiTE concentration is very low (< 2 ng/ml).

      Revision: we have made text adjustment to increase clarity on these points. In addition, we added: “The base model underpredicted the effect of CD3 expression on IS formation at 100 ng/ml BiTE concentration, which is partially because of the rapid CD3 downregulation upon BiTE engagement and assay variation across experimental conditions.”

      The model does not include signaling and activation of T cells as they form the immunological synapse (IS) with target cells. The formation IS leads to aggregation of different receptors, adhesion molecules, and kinases which modulate signaling and activation. Thus, it is likely the variations of the copy numbers of CD3, and the CD19-BiTE-CD3 will lead to variations in the cytotoxic responses and presumably to CD3 degradation as well. Perhaps some of these missing processes are responsible for the disagreements between the model and the data shown in Fig. 3. In addition, the in vivo model does not contain any development of the T cells as they are stimulated by the BiTEs. The differences in development of T cells, such as generation of dysfunctional/exhausted T cells could lead to the differences in responses to BiTEs in patients. In particular, the in vivo model does not agree with the kinetics of B cells after day 29 in non-responders (Fig. 6d); could the kinetics of T cell development play a role in this?

      We agree that intracellular signaling is critical to T cell activation and cytotoxic effects. IS formation, T cell activation, and cytotoxicity are a cascade of events with highly coordinated molecular and cellular interactions. Compared to the events of T cell activation and cytotoxicity, IS formation occurs at a relatively earlier time. As shown in our study, IS formation can occur at 2-5 min, while the other events often need hours to be observed. We found that IS formation is primarily driven by two intercellular processes: cell-cell encounter and cell-cell adhesion. The intracellular signaling would be initiated in the process of cell-cell adhesion or at the late stage of IS formation. We think these intracellular events are relevant but may not be the reason why our model did not adequately capture the profiles in Figure 3d at the highest BiTE concentrations. Therefore, we did not include intracellular signaling in the models. Another reason was that we simulated our models at an agent level to mimic the process of tumor evolution, which is computationally demanding. Intracellular events for each cell may make it more challenging computationally.

      T cell activation and exhaustion throughout the BiTE treatment is very complicated, time-variant and impacted by multiple factors like T cell status, tumor burden, BiTE concentration, immune checkpoints, and tumor environment. T cell proliferation and death rates are challenging to estimate, as the quantitative relationship with those factors is unknown. Therefore, T cell abundance (expansion) was considered as an independent variable in our model. T cell counts are measured in BiTE clinical trials. We included these data in our model to reveal expanded T cell population. Patients with high T cell expansion are often those with better clinical response. Notably, the T cell decline due to rapid redistribution after administration was excluded in the model. T cell abundance was included in the simulations in Figure 6 but not proof of concept simulations in Figure 7.

      In Figure 6d, kinetics of T cell abundance had been included in the simulations for responders and non-responders in MT103-211 study. Thus, the kinetics of T cell development can’t be used to explain the disagreement between model prediction and observation after day 29 in non-responders. The observed data is actually median values of B-cell kinetics in non-responders (N = 27) with very large inter-subject variation (baseline from 10-10000/μL), which makes it very challenging to be perfectly captured by the model. A lot of non-responders with severe progression dropped out of the treatment at the end of cycle 1, which resulted in a “more potent” efficacy in the 2nd cycle. This might be main reason for the disagreement.

      Variation in cytotoxic response was not included in our models. Tumor cells were assumed to be eradicated after the engagement with effecter cells, no killing rate or killing probability was implemented. This assumption reduced the model complexity and aligned well with our in-vitro and clinical data. Cytotoxic response in vivo is impacted by multiple factors like copy number of CD3, cytokine/chemokine release, tumor microenvironment and T cell activation/exhaustion. For example, the cytotoxic response and killing rate mediated by 1:1 synapse (ET) and other variants (ETE, TET, ETEE, etc.) are supposed to be different as well. Our model did not differentiate the killing rate of these synapse variants, but the model has quantified these synapse variants, providing a framework for us to address these questions in the future. We agree that differentiate the cytotoxic responses under different scenarios cell may improve model prediction and more explorations need to be done in the future.

      Revision: We added a discussion of the limitations which we believe is informative to future studies.

      “Our models did not include intracellular signaling processes, which are critical for T activation and cytotoxicity. However, our data suggests that encounter and adhesion are more relevant to initial IS formation. To make more clinically relevant predictions, the models should consider these intracellular signaling events that drive T cell activation and cytotoxic effects. Of note, we did consider the T cell expansion dynamics in organs as independent variable during treatment for the simulations in Figure 6. T cell expansion in our model is case-specific and time-varying.”

      References:

      Chen W, Yang F, Wang C, Narula J, Pascua E, Ni I, Ding S, Deng X, Chu ML, Pham A, Jiang X, Lindquist KC, Doonan PJ, Blarcom TV, Yeung YA, Chaparro-Riggers J. 2021. One size does not fit all: navigating the multi-dimensional space to optimize T-cell engaging protein therapeutics. MAbs 13:1871171. DOI: 10.1080/19420862.2020.1871171, PMID: 33557687

      Dang K, Castello G, Clarke SC, Li Y, AartiBalasubramani A, Boudreau A, Davison L, Harris KE, Pham D, Sankaran P, Ugamraj HS, Deng R, Kwek S, Starzinski A, Iyer S, Schooten WV, Schellenberger U, Sun W, Trinklein ND, Buelow R, Buelow B, Fong L, Dalvi P. 2021. Attenuating CD3 affinity in a PSMAxCD3 bispecific antibody enables killing of prostate tumor cells with reduced cytokine release. Journal for ImmunoTherapy of Cancer 9:e002488. DOI: 10.1136/jitc-2021-002488, PMID: 34088740

      Gong C, Anders RA, Zhu Q, Taube JM, Green B, Cheng W, Bartelink IH, Vicini P, Wang BPopel AS. 2019. Quantitative Characterization of CD8+ T Cell Clustering and Spatial Heterogeneity in Solid Tumors. Frontiers in Oncology 8:649. DOI: 10.3389/fonc.2018.00649, PMID: 30666298

      Mejstríková E, Hrusak O, Borowitz MJ, Whitlock JA, Brethon B, Trippett TM, Zugmaier G, Gore L, Stackelberg AV, Locatelli F. 2017. CD19-negative relapse of pediatric B-cell precursor acute lymphoblastic leukemia following blinatumomab treatment. Blood Cancer Journal 7: 659. DOI: 10.1038/s41408-017-0023-x, PMID: 29259173

      Samur MK, Fulciniti M, Samur AA, Bazarbachi AH, Tai YT, Prabhala R, Alonso A, Sperling AS, Campbell T, Petrocca F, Hege K, Kaiser S, Loiseau HA, Anderson KC, Munshi NC. 2021. Biallelic loss of BCMA as a resistance mechanism to CAR T cell therapy in a patient with multiple myeloma. Nature Communications 12:868. DOI: 10.1038/s41467-021-21177-5, PMID: 33558511

      Xu X, Sun Q, Liang X, Chen Z, Zhang X, Zhou X, Li M, Tu H, Liu Y, Tu S, Li Y. 2019. Mechanisms of relapse after CD19 CAR T-cell therapy for acute lymphoblastic leukemia and its prevention and treatment strategies. Frontiers in Immunology 10:2664. DOI: 10.3389/fimmu.2019.02664, PMID: 31798590

      Yoneyama T, Kim MS, Piatkov K, Wang H, Zhu AZX. 2022. Leveraging a physiologically-based quantitative translational modeling platform for designing B cell maturation antigen-targeting bispecific T cell engagers for treatment of multiple myeloma. PLOS Computational Biology 18: e1009715. DOI: 10.1371/journal.pcbi.1009715, PMID: 35839267

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, the authors present a new technique for analysing low complexity regions (LCRs) in proteins- extended stretches of amino acids made up from a small number of distinct residue types. They validate their new approach against a single protein, compare this technique to existing methods, and go on to apply this to the proteomes of several model systems. In this work, they aim to show links between specific LCRs and biological function and subcellular location, and then study conservation in LCRs amongst higher species.

      The new method presented is straightforward and clearly described, generating comparable results with existing techniques. The technique can be easily applied to new problems and the authors have made code available.

      This paper is less successful in drawing links between their results and the importance biologically. The introduction does not clearly position this work in the context of previous literature, using relatively specialised technical terms without defining them, and leaving the reader unclear about how the results have advanced the field. In terms of their results, the authors further propose interesting links between LCRs and function. However, their analyses for these most exciting results rely heavily on UMAP visualisation and the use of tests with apparently small effect sizes. This is a weakness throughout the paper and reduces the support for strong conclusions.

      We appreciate the reviewer’s comments on our manuscript. To address comments about the clarity of the introduction and the position of our findings with respect to the rest of the field, we have made several changes to the text. We have reworked the introduction to provide a clearer view of the current state of the LCR field, and our goals for this manuscript. We also have made several changes to the beginnings and ends of several sections in the Results to explicitly state how each section and its findings help advance the goal we describe in the introduction, and the field more generally. We hope that these changes help make the flow of the paper more clear to the reader, and provide a clear connection between our work and the field.

      We address comments about the use of UMAPs and statistical tests in our responses to the specific comments below.

      Additionally, whilst the experimental work is interesting and concerns LCRs, it does not clearly fit into the rest of the body of work focused as it is on a single protein and the importance of its LCRs. It arguably serves as a validation of the method, but if that is the author's intention it needs to be made more clearly as it appears orthogonal to the overall drive of the paper.

      In response to this comment, we have made more explicit the rationale for choosing this protein at the beginning of this section, and clarify the role that these experiments play in the overall flow of the paper.

      Our intention with the experiments in Figure 2 was to highlight the utility of our approach in understanding how LCR type and copy number influence protein function. Understanding how LCR type and copy number can influence protein function is clearly outlined as a goal of the paper in the Introduction.

      In the text corresponding to Figure 2, we hypothesize how different LCR relationships may inform the function of the proteins that have them, and how each group in Figure 2A/B can be used to test these hypotheses. The global view provided by our method allows proteins to be selected on the basis of their LCR type and copy number for further study.

      To demonstrate the utility of this view, we select a key nucleolar protein with multiple copies of the same LCR type (RPA43, a subunit of RNA Pol I), and learn important features driving its higher-order assembly in vivo and in vitro. We learned that in vivo, a least two copies of RPA43’s K-rich LCRs are required for nucleolar integration, and that these K-rich LCRs are also necessary for in vitro phase separation.

      Despite this protein being a single example, we were able to gain important insights about how K-rich LCR copy number affects protein function, and that both in vitro higher order assembly and in vivo nucleolar integration can be explained by LCR copy number. We believe this opens the door to ask further questions about LCR type and copy number for other proteins using this line of reasoning.

      Overall I think the ideas presented in the work are interesting, the method is sound, but the data does not clearly support the drawing of strong conclusions. The weakness in the conclusions and the poor description of the wider background lead me to question the impact of this work on the broader field.

      For all the points where Reviewer #1 comments on the data and its conclusions, we provide explanations and additional analyses in our responses below showing that the data do indeed support our conclusions. In regards to our description of the wider background, we have reworked our introduction to more clearly link our work to the broader field, such that a more general audience can appreciate the impact of our work.

      Technical weaknesses

      In the testing of the dotplot based method, the manuscript presents a FDR rate based on a comparison between real proteome data and a null proteome. This is a sensible approach, but their choice of a uniform random distribution would be expected to mislead. This is because if the distribution is non-uniform, stretches of the most frequent amino will occur more frequently than in the uniform distribution.

      Thank you for pointing this out. The choice of null proteome was a topic of much discussion between the authors as this work was being performed. While we maintain that the uniform background is the most appropriate, the question from this reviewer and the other reviewers made us realize that a thorough explanation was warranted. For a complete explanation for our choice of this uniform null model, please see the newly added appendix section, Appendix 1.

      The authors would also like to point out that the original SEG algorithm (Wootton and Federhen, 1993) also made the intentional choice of using a uniform background model.

      More generally I think the results presented suggest that the results dotplot generates are comparable to existing methods, not better and the text would be more accurate if this conclusion was clearer, in the absence of an additional set of data that could be used as a "ground truth".

      We did not intend to make any strong claims about the relative performance of our approach vs. existing methods with regard to the sequence entropy of the called LCRs beyond them being comparable, as this was not the main focus of our paper. To clarify the text such that it reflects this, we have removed ‘or better’ from the text in this section.

      The authors draw links between protein localisation/function and LCR content. This is done through the use of UMAP visualisation and wilcoxon rank sum tests on the amino acid frequency in different localisations. This is convincing in the case of ECM data, but the arguments are substantially less clear for other localisations/functions. The UMAP graphics show generally that the specific functions are sparsely spread. Moreover when considering the sample size (in the context of the whole proteome) the p-value threshold obscures what appear to be relatively small effect sizes.

      We would first like to note that some of the amino acid frequency biases have been documented and experimentally validated by other groups, as we write and reference in the manuscript. Nonetheless, we have considered the reviewer's concerns, and upon rereading the section corresponding to Figure 3, we realize that our wording may have caused confusion in the interpretation there. In addition to clarifying this in the manuscript, we believe the following clarification may help in the interpretations drawn from that section.

      Each point in this analysis (and on the UMAP) is an LCR from a protein, and as such multiple LCRs from the same protein will appear as multiple points. This is particularly relevant for considering the interpretation of the functional/higher order assembly annotations because it is not expected that for a given protein, all of the LCRs will be directly relevant to the function/annotation. Just because proteins of an assembly are enriched for a given type of LCR does not mean that they only have that kind of LCR. In addition to the enriched LCR, they may or may not have other LCRs that play other roles.

      For example, a protein in the Nuclear Speckle may contain both an R/S-rich LCR and a Q-rich LCR. When looking at the Speckle, all of the LCRs of a protein are assigned this annotation, and so such a protein would contribute a point in the R/S region as well as elsewhere on the map. Because such "non-enriched" LCRs do not occur as frequently, and may not be relevant to Speckle function, they are sparsely spread.

      We have now changed the wording in that section of the main text to reflect that the expectation is not all LCRs mapping to a certain region, but enrichment of certain LCR compositions.

      Reviewer #3 (Public Review):

      The authors present a systematic assessment of low complexity sequences (LCRs) apply the dotplot matrix method for sequence comparison to identify low-complexity regions based on per-residue similarity. By taking the resulting self-comparison matrices and leveraging tools from image processing, the authors define LCRs based on similarity or non-similarity to one another. Taking the composition of these LCRs, the authors then compare how distinct regions of LCR sequence space compare across different proteomes.

      The paper is well-written and easy to follow, and the results are consistent with prior work. The figures and data are presented in an extremely accessible way and the conclusions seem logical and sound.

      My big picture concern stems from one that is perhaps challenging to evaluate, but it is not really clear to me exactly what we learn here. The authors do a fine job of cataloging LCRs, offer a number of anecdotal inferences and observations are made - perhaps this is sufficient in terms of novelty and interest, but if anyone takes a proteome and identifies sequences based on some set of features that sit in the tails of the feature distribution, they can similarly construct intriguing but somewhat speculative hypotheses regarding the possible origins or meaning of those features.

      The authors use the lysine-repeats as specific examples where they test a hypothesis, which is good, but the importance of lysine repeats in driving nucleolar localization is well established at this point - i.e. to me at least the bioinformatics analysis that precedes those results is unnecessary to have made the resulting prediction. Similarly, the authors find compositional biases in LCR proteins that are found in certain organelles, but those biases are also already established. These are not strictly criticisms, in that it's good that established patterns are found with this method, but I suppose my concern is that this is a lot of work that perhaps does not really push the needle particularly far.

      As an important caveat to this somewhat muted reception, I recognize that having worked on problems in this area for 10+ years I may also be displaying my own biases, and perhaps things that are "already established" warrant repeating with a new approach and a new light. As such, this particular criticism may well be one that can and should be ignored.

      We thank the reviewer for taking the time to read and give feedback for our manuscript. We respectfully disagree that our work does not push the needle particularly far.

      In the section titled ‘LCR copy number impacts protein function’, our goal is not to highlight the importance of lysines in nucleolar localization, but to provide a specific example of how studying LCR copy number, made possible by our approach, can provide specific biological insights. We first show that K-rich LCRs can mediate in vitro assembly. Moreover, we show that the copy number of K-rich LCRs is important for both higher order assembly in vitro and nucleolar localization in cells, which suggests that by mediating interactions, K-rich LCRs may contribute to the assembly of the nucleolus, and that this is related to nucleolar localization. The ability of our approach to relate previously unrelated roles of K-rich LCRs not only demonstrates the value of a unified view of LCRs but also opens the door to study LCR relationships in any context.

      Furthermore, our goal in identifying established biases in LCR composition for certain assemblies was to validate that the sequence space captures higher order assemblies which are known. In addition to known biases, we use our approach to uncover the roles of LCR biases that have not been explored (e.g. E-rich LCRs in nucleoli, see Figure 4 in revised manuscript), and discover new regions of LCR sequence space which have signatures of higher order assemblies (e.g. Teleost-specific T/H-rich LCRs). Collectively, our results show that a unified view of LCRs relates the disparate functions of LCRs.

      In response to these comments, we have added additional explanations at the end of several sections to clarify the impact of our findings in the scope of the broader field. Furthermore, as we note in our main response, we have added experimental data with new findings to address this concern.

      That overall concern notwithstanding, I had several other questions that sprung to mind.

      Dotplot matrix approach

      The authors do a fantastic job of explaining this, but I'm left wondering, if one used an algorithm like (say) SEG, defined LCRs, and then compared between LCRs based on composition, would we expect the results to be so different? i.e. the authors make a big deal about the dotplot matrix approach enabling comparison of LCR type, but, it's not clear to me that this is just because it combines a two-step operation into a one-step operation. It would be useful I think to perform a similar analysis as is done later on using SEG and ask if the same UMAP structure appears (and discuss if yes/no).

      Thank you for your thoughtful question about the differences between SEG and the dotplot matrix approach. We have tried our best to convey the advantages of the dotplot approach over SEG in the paper, but we did not focus on this for the following reasons:

      1) SEG and dotplot matrices are long-established approaches to assessing LCRs. We did not see it in the scope of our paper to compare between these when our main claim is that the approach as a whole (looking at LCR sequence, relationships, features, and functions) is what gives a broader understanding of LCRs across proteomes. The key benefits of dotplots, such as direct visual interpretation, distinguishing LCR types and copy number within a protein, are conveyed in Figure 1A-C and Figure 1 - figure supplements 1 and 4. In fact, these benefits of dotplots were acknowledged in the early SEG papers, where they recommended using dotplots to gain a prior understanding of protein sequences of interest, when it was not yet computationally feasible to analyze dotplots on the same scale as SEG (Wootton and Federhen, Methods in Enzymology, vol. 266, 1996, Pages 554-571). Thus, our focus is on the ability to utilize image processing tools to "convert" the intuition of dotplots into precise read-out of LCRs and their relationships on a multi-proteome scale. All that being said, we have considered differences between these methods as you can see from our technical considerations in part 2 below.

      2) SEG takes an approach to find LCRs irrespective of the type of LCR, primarily because SEG was originally used to mask LCR-containing regions in proteins to facilitate studies of globular domains. Because of this, the recommended usage of SEG commonly fuses nearby LCRs and designates the entire region as "low complexity". For the original purpose of SEG, this is understandable because it takes a very conservative approach to ensure that the non-low complexity regions (i.e. putative folded domains) are well-annotated. However, for the purpose of distinguishing LCR composition, this is not ideal because it is not stringent in separating LCRs that are close together, but different in composition. Fusion can be seen in the comparison of specific LCR calls of the collagen CO1A1 (Figure 1 - figure supplement 3E), where even the intermediate stringency SEG settings fuse LCR calls that the dotplot approach keeps separate. Finally, we did also try downstream UMAP analysis with LCRs called from SEG, and found that although certain aspects of the dotplot-based LCR UMAP are reflected in the SEG-based LCR UMAP, there is overall worse resolution with default settings, which is likely due to fused LCRs of different compositions. Attempting to improve resolution using more stringent settings comes at the cost of the number of LCRs assessed. We have attached this analysis to our rebuttal for the reviewer, but maintain that this comparison is not really the focus of our manuscript. We do not make strong claims about the dotplot matrices being better at calling LCRs than SEG, or any other method.

      UMAPs generated from LCRs called by SEG

      LCRs from repeat expansions

      I did not see any discussion on the role that repeat expansions can play in defining LCRs. This seems like an important area that should be considered, especially if we expect certain LCRs to appear more frequently due to a combination of slippy codons and minimal impact due to the biochemical properties of the resulting LCR. The authors pursue a (very reasonable) model in which LCRs are functional and important, but it seems the alternative (that LCRs are simply an unavoidable product of large proteomes and emerge through genetic events that are insufficiently deleterious to be selected against). Some discussion on this would be helpful. it also makes me wonder if the authors' null proteome model is the "right" model, although I would also say developing an accurate and reasonable null model that accounts for repeat expansions is beyond what I would consider the scope of this paper.

      While the role of repeat expansions in generating LCRs has been studied and discussed extensively in the LCR field, we decided to focus on the question of which LCRs exist in the proteome, and what may be the function downstream of that. The rationale for this is that while one might not expect a functional LCR to arise from repeat expansion, this argument is less of a concern in the presence of evidence that these LCRs are functional. For example, for many of these LCRs (e.g. a K-rich LCR, R/S-rich LCR, etc as in Figure 3), we know that it is sufficient for the integration of that sequence into the higher order assembly. Moreover, in more recent cases, variation of the length of an LCR was shown to have functional consequences (Basu et al., Cell, 2020), suggesting that LCR emergence through repeat expansions does not imply lack of function. Therefore, while we think the origin of a LCR is an interesting question, whether or not that LCR was gained through repeat expansions does not fall into the scope of this paper.

      In regards to repeat expansions as it pertains to our choice of null model, we reasoned that because the origin of an LCR is not necessarily coupled to its function, it would be more useful to retain LCR sequences even if they may be more likely to occur given a background proteome composition. This way, instead of being tossed based on an assumption, LCRs can be evaluated on their function through other approaches which do not assume that likelihood of occurrence inversely relates to function.

      While we maintain that the uniform background is the most appropriate, the question from this reviewer and the other reviewers made us realize that a thorough explanation was warranted for this choice of null proteome. For a complete explanation for our choice of this uniform null model, please see the newly added appendix section, Appendix 1.

      The authors would also like to point out that the original SEG algorithm (Wootton and Federhen, 1993) also made the intentional choice of using a uniform background model.

      Minor points

      Early on the authors discuss the roles of LCRs in higher-order assemblies. They then make reference to the lysine tracts as having a valence of 2 or 3. It is possibly useful to mention that valence reflects the number of simultaneous partners that a protein can interact with - while it is certainly possible that a single lysine tracts interacts with a single partner simultaneously (meaning the tract contributes a valence of 1) I don't think the authors can know that, so it may be wise to avoid specifying the specific valence.

      Thank you for pointing this out. We agree with the reviewer's interpretation and have removed our initial interpretation from the text and simply state that a copy number of at least two is required for RPA43’s integration into the nucleolus.

      The authors make reference to Q/H LCRs. Recent work from Gutiérrez et al. eLife (2022) has argued that histidine-richness in some glutamine-rich LCRs is above the number expected based on codon bias, and may reflect a mode of pH sensing. This may be worth discussing.

      We appreciate the reviewer pointing out this publication. While this manuscript wasn’t published when we wrote our paper, upon reading it we agree it has some very relevant findings. We have added a reference to this manuscript in our discussion when discussing Q/H-rich LCRs.

      Eric Ross has a number of very nice papers on this topic, but sadly I don't think any of them are cited here. On the question of LCR composition and condensate recruitment, I would recommend Boncella et al. PNAS (2020). On the question of proteome-wide LCR analysis, see Cascarina et al PLoS CompBio (2018) and Cascarina et al PLoS CompBio 2020.

      We appreciate the reviewer for noting this related body of work. We have updated the citations to include work from Eric Ross where relevant.

    1. Author Response

      Reviewer #1 (Public Review):

      This study examines the factors underlying the assembly of MreB, an actin family member involved in mediating longitudinal cell wall synthesis in rod-shaped bacteria. Required for maintaining rod shape and essential for growth in model bacteria, single molecule work indicates that MreB forms treadmilling polymers that guide the synthesis of new peptidoglycan along the longitudinal cell wall. MreB has proven difficult to work with and the field is littered with artifacts. In vitro analysis of MreB assembly dynamics has not fared much better as helpfully detailed in the introduction to this study. In contrast to its distant relative actin, MreB is difficult to purify and requires very specific conditions to polymerize that differ between groups of bacteria. Currently, in vitro analysis of MreB and related proteins has been mostly limited to MreBs from Gram-negative bacteria which have different properties and behaviors from related proteins in Gram-positive organisms.

      Here, Mao and colleagues use a range of techniques to purify MreB from the Gram-positive organism Geobacillus stearothermophilus, identify factors required for its assembly, and analyze the structure of MreB polymers. Notably, they identify two short hydrophobic sequences-located near one another on the 3-D structure-which are required to mediate membrane anchoring.

      With regard to assembly dynamics, the authors find that Geobacillus MreB assembly requires both interactions with membrane lipids and nucleotide binding. Nucleotide hydrolysis is required for interaction with the membrane and interaction with lipids triggers polymerization. These experiments appear to be conducted in a rigorous manner, although the salt concentration of the buffer (500mM KCl) is quite high relative to that used for in vitro analysis of MreBs from other organisms. The authors should elaborate on their decision to use such a high salt buffer, and ideally, provide insight into how it might impact their findings relative to previous work.

      Response 1.1. MreB proteins are notoriously difficult to maintain in a soluble form. Some labs deleted the N-terminal amphipathic or hydrophobic sequences to increase solubility, while other labs used full-length protein but high KCl concentration (300 mM KCl) (Harne et al, 2020; Pande et al., 2022; Popp et al, 2010; Szatmari et al, 2020). Early in the project, we tested many conditions and noticed that high KCl helped keeping a slightly better solubility of full length MreBGs, without the need for deleting a part of the protein. In addition, concentrations of salt > 100 mM would better mimic the conditions met by the protein in vivo. While 50-100 mM KCl is traditionally used in actin polymerization assays, physiological salt concentrations are around 100-150 mM KCl in invertebrates and vertebrates (Schmidt-Nielsen, 1975), around 50-250 in fungal and plant cells (Rodriguez-Navarro, 2000) and 200-300 mM in the budding yeast (Arino et al, 2010). However, cytoplasmic K+ concentration varies greatly (up to 800 mM) depending on the osmolality of the medium in both E. coli (Cayley et al, 1991; Epstein & Schultz, 1965; Rhoads et al, 1976), and B. subtilis, in which the basal intracellular concentration of KCl was estimated to be ~ 350 mM (Eisenstadt, 1972; Whatmore et al, 1990). 500 mM KCl can therefore be considered as physiological as 100 mM KCl for bacterial cells. Since we observed plenty of pairs of protofilaments at 500 mM KCl and this condition helped to avoid aggregation, we kept this high concentration as a standard for most of our experiments. Nonetheless, we had also performed TEM polymerization assays at 100 mM in line with most of MreB and F-actin in vitro literature, and found no difference in the polymerization (or absence of polymerization) conditions. This was indicated in the initial submission (e.g. M&M section L540 and footnote of Table S2) but since two reviewers bring it up as a main point, it is evident we failed at communicating it clearly, for which we apologize. This has been clarified in the revised version of the manuscript. We have also almost systematically added the 100 mM KCl concentration too as per reviewer #2 request and to conciliate our salt conditions with those used for some in vitro analysis of MreBs from other organisms (see also response to reviewer #2 comments 1A and 1B = Responses 2.1A, 2.1B below). We then decided to refer to the 100 mM KCl concentration as our “standard condition” in the revised version of the manuscript, but we compile and compare the results obtained at 500 mM too, as both concentrations are within the physiological range in Bacillus.

      Additionally, this study, like many others on MreB, makes much of MreB's relationship to actin. This leads to confusion and the use of unhelpful comparisons. For example, MreB filaments are not actin-like (line 58) any more than any polymer is "actin-like." As evidenced by the very beautiful images in this manuscript, MreB forms straight protofilaments that assemble into parallel arrays, not the paired-twisted polymers that are characteristic of F-actin. Generally, I would argue that work on MreB has been hindered by rather than benefitted from its relationship to actin (E.g early FP fusion data interpreted as evidence for an MreB endoskeleton supporting cell shape or depletion experiments implicating MreB in chromosome segregation) and thus such comparisons should be avoided unless absolutely necessary.

      Response 1.2. We completely agree with reviewer #1 regarding unhelpful comparisons of actin and MreB, and that work on MreB has been traditionally hindered from its relationship to eukaryotic actin. MreB is nonetheless a structural homolog of actin, with a close structural fold and common properties (polymerization into pairs of protofilaments, ATPase activity…). It still makes sense to refer to a protein with common features, common ancestry and widely studied as long as we don’t enclose our mind into a conceptual framework. This said, actin and MreB diverged very early in evolution, which may account for differences in their biochemical properties and cellular functions. Current data on MreB filaments confirm that they display F-actin-like and F-actin-unlike properties. We thank the reviewer for this insightful comment. We have revised the text to remove any inaccurate or unhelpful comparison to actin (in particular the ‘actin-like filaments’ statement, previously used once)

      Reviewer #2 (Public Review):

      The paper "Polymerization cycle of actin homolog MreB from a Gram-positive bacterium" by Mao et al. provides the second biochemical study of a gram-positive MreB, but importantly, the first study examines how gram-positive MreB filaments bind to membranes. They also show the first crystal structure of a MreB from a Gram-positive bacterium - in two nucleotide-bound forms, finally solving structures that have been missing for too long. They also elucidate what residues in Geobacillus MreB are required for membrane associations. Also, the QCM-D approach to monitoring MreB membrane associations is a direct and elegant assay.

      While the above findings are novel and important, this paper also makes a series of conclusions that run counter to multiple in vitro studies of MreBs from different organisms and other polymers with the actin fold. Overall, they propose that Geobacillus MreB contains biochemical properties that are quite different than not only the other MreBs examined so far but also eukaryotic actin and every actin homolog that has been characterized in vitro. As the conclusions proposed here would place the biochemical properties of Geobacillus MreB as the sole exception to all other actin fold polymers, further supporting experiments are needed to bolster these contrasting conclusions and their overall model.

      Response 2.0. We are grateful to reviewer #2 for stressing out the novelty and importance of our results. Most of our conclusions were in line with previous in vitro studies of MreBs (formation of pairs of straight filaments on a lipid layer, both ATP and GTP binding and hydrolysis, distortion of liposomes…), to the exception of the claimed requirement of NTP hydrolysis for membrane binding prior to polymerization based on the absence of pairs of filaments in free solution or in the presence of AMP-PNP in our experimental conditions (which we agree was not sufficient to make such a bold claim, see below). Thanks to the reviewer’s comments, we have performed many controls and additional experiments that lead us to refine our results and largely conciliate them with the literature. Please see the answer to the global review comments - our conclusions have been revised on the basis of our new data.

      1. (Difference 1) - The predominant concern about the in vitro studies that makes it difficult to evaluate many of their results (much less compare them to other MreB/s and actin homologs) is the use of a highly unconventional polymerization buffer containing 500(!) mM KCL. As has been demonstrated with actin and other polymers, the high KCl concentration used here (500mM) is certain to affect the polymerization equilibria, as increasing salt increases the hydrophobic effect and inhibits salt bridges, and therefore will affect the affinity between monomers and filaments. For example, past work has shown that high salt greatly changes actin polymerization, causing: a decreased critical concentration, increased bundling, and a greatly increased filament stiffness (Kang et al., 2013, 2012). Similarly, with AlfA, increased salt concentrations have been shown to increase the critical concentration, decrease the polymerization kinetics, and inhibit the bundling of AlfA filaments (Polka et al., 2009).

      A more closely related example comes from the previous observation that increasing salt concentrations increasingly slow the polymerization kinetics of B. subtilis MreB (Mayer and Amann, 2009). Lastly, These high salt concentrations might also change the interactions of MreB(Gs) with the membrane by screening charges and/or increasing the hydrophobic effect. Given that 500mM KCl was used throughout this paper, many (if not all) of the key experiments should be repeated in more standard salt concentration (~100mM), similar to those used in most previous in vitro studies of polymers.

      Response 2.1A. As per reviewer #2 request, we have done at 100 mM KCl too most experiments (TEM, cryo-EM, QCMD and ATPase assays) initially performed at 500 mM KCl only. The KCl concentration affects both membrane binding and filament stiffness as anticipated by the reviewer but the main conclusions are the same. The revised version of the manuscript compiles and compares the results obtained at both high and low [KCl], both concentrations being within the physiological range in Bacillus. Please see point 1 of the response to the global review comments and the first response to reviewer 1 (Response 1.1) for further elaboration.

      Please note that in Mayer & Amann, 2009 (B. subtilis MreB), light scattering in free solution was inversely proportional to the KCl concentration, with the higher light scattering signal at 0 mM KCl (!), a > 2-fold reduction below 30 mM KCl and no scatter at all at 250 mM, suggesting a “salting in” phenomenon (see also the “Other Points to address” answers 1A and 2, below) (Mayer & Amann, 2009). Since no effective polymer formation (e.g. polymers shown by EM) was demonstrated in these experiments, it cannot be excluded that KCl was simply preventing aggregation of B. subtilis MreB in solution, as we observe. For all their other light scattering experiments, the ‘standard polymerization condition’ used by Mayer & Amann was 0.2 mM ATP, 5 mM MgCl2, 1 mM EGTA and 10 mM imidazole pH 7.0, to which MreB (in 5 mM Tris pH 8.0) was added. No KCl was present in their ‘standard’ polymerization conditions.

      This would test if the many divergent properties of MreB(Gs) reported here arise from some difference in MreB(Gs) relative to other MreBs (and actin homologs), or if they arise from the 400mM difference in salt concentration between the studies. Critically, it would also allow direct comparisons to be made relative to previous studies of MreB (and other actin homologs) that used much lower salt, thereby allowing them to definitively demonstrate whether MreB(Gs) is indeed an outlier relative to other MreB and actin homologs. I would suggest using 100mM KCL, as historically, all polymerization assays of actin and numerous actin homologs have used 50-100mM KCL: 50mM KCl (for actin in F buffer) or 100mM KCl for multiple prokaryotic actin homologs and MreB (Deng et al., 2016; Ent et al., 2014; Esue et al., 2006, 2005; Garner et al., 2004 ; Polka et al., 2009 ; Rivera et al., 2011 ; Salje et al., 2011). Likewise, similar salt concentrations are standard for tubulin (80 mM K-Pipes) and FtsZ (100 mM KCl or 100mM KAc in HMK100 buffer).

      Response 2.1B. We appreciate the reviewer’s feedback on this point. Please note that, although actin polymerization assays are historically performed at 50-100 mM KCl and thus 100 mM KCl was used for other bacterial actin homologs (MamK, ParM and AlfA), MreB polymerization assays have previously been reported at 300 mM KCl too (Harne et al., 2020; Pande et al., 2022; Popp et al., 2010; Szatmari et al., 2020), which is closer to the physiological salt concentration in bacterial cells (see Response 1.1), but also in the absence of KCl (see above). As a matter of fact, we originally wanted to use a “standard polymerization condition” based on the literature on MreB, before realizing there was none: only half used KCl (the other half used NaCl, or no monovalent salt at all) and among these, KCl concentrations varied (out of 8 publications, 2 used 20 mM KCl, 2 used 50 mM KCl and 4 used 300 mM KCl).

      1. (Difference 2) - One of the most important differences claimed in this paper is that MreB(Gs) filaments are straight, a result that runs counter to the curved T. Maritima and C. crescentus filaments detailed by the Löwe group (Ent et al., 2014; Salje et al., 2011). Importantly, this difference could also arise from the difference in salt concentrations used in each study (500mM here vs. 100mM in the Löwe studies), and thus one cannot currently draw any direct comparisons between the two studies.

      One example of how high salt could be causing differences in filament geometry: high salts are known to greatly increase the bending stiffness of actin filaments, making them more rigid (Kang et al., 2013). Likewise, increasing salt is known to change the rigidity of membranes. As the ability of filaments to A) bend the membrane or B) Deform to the membrane depends on the stiffness of filaments relative to the stiffness of the membrane, the observed difference in the "straight vs. curved" conformation of MreB filaments might simply arise from different salt concentrations. Thus, in order to draw several direct comparisons between their findings and those of other MreB orthologs (as done here), the studies of MreB(GS) confirmations on lipids should be repeated at the same buffer conditions as used in the Löwe papers, then allowing them to be directly compared.

      Response 2.2. We fully agreed with reviewer #2 that the salts could be affecting the assay and did cryo-EM experiments also in the presence of 100 mM KCl as requested. The results unambiguously showed countless curved liposomes on the contact areas with MreB (Fig. 2F-G and Fig. 2-S5), very similar to what was reported for Thermotoga and Caulobacter MreBs by the Lowe group. Our results therefore confirm the previous findings that MreBs can bend lipids, and suggest that, indeed, high salt may increase filament stiffness as it has been shown for actin filaments. We are very grateful to reviewer #2 for his suggestion and for drawing our attention to the work of Kang et al, 2013. The different bending observed when varying the salt concentration raise relevant questions regarding the in vivo behavior of MreB, since KCl was shown to vary greatly depending on the medium composition. The manuscript has been updated accordingly in the Results (from L243) and Discussion sections (L585-595).

      1. (Difference 3) - The next important difference between MreB(Gs) and other MreBs is the claim that MreB polymers do not form in the absence of membranes.

      A) This is surprising relative to other MreBs, as MreBs from 1) T. maritime (multiple studies), E.coli (Nurse and Marians, 2013), and C. crescentus (Ent et al., 2014) have been shown to form polymers in solution (without lipids) with electron microscopy, light scattering, and time-resolved multi-angle light scattering. Notably, the Esue work was able to observe the first phase of polymer formation and a subsequent phase of polymer bundling (Esue et al., 2006) of MreB in solution. 2) Similarly, (Mayer and Amann, 2009) demonstrated B. subtilis MreB forms polymers in the absence of membranes using light scattering.

      Response 2.3A. The literature does convincingly show that Thermotoga MreB forms polymers in solution, without lipids (note that for Caulobacter MreB filaments were only reported in the presence of lipids, (van den Ent et al, 2014)). Assemblies reported in solution are bundles or sheets (included in at the earlier time points in the time-resolved EM experiments reported by Esue et al. 2006 mentioned by the reviewer – ‘2 minutes after adding ATP, EM revealed that MreB formed short filamentous bundles’) (Esue et al, 2006). However, and as discussed above (Response 2.1A), the light scattering experiments in Mayer et Amann, 2009 do not conclusively demonstrate the presence of polymers of B. subtilis MreB in solution (Mayer & Amann, 2009). We performed many light scattering experiments of B. subtilis MreB in solution in the past (before finding out that filaments were only forming in the presence of lipids), and got similar scattering curves (see two examples of DLS experiments in Author response image 1) in conditions in which NO polymers could ever been observed by EM while plenty of aggregates were present.

      Author response image 1.

      We did not consider these results publishable in the absence of true polymers observed by TEM. As pointed out on the interesting study from Nurse et al. (on E. coli MreB) (Nurse & Marians, 2013), one cannot rely only on light scattering only because non-specific aggregates would show similar patterns than polymers. Over the last two decades, about 15 publications showed polymers of MreB from several Gram-negative species, while none (despite the efforts of many) showed a single convincing MreB polymer from a Gram-positive bacterium by EM. A simple hypothesis is that a critical parameter was missing, and we present convincing evidence that lipids are critical for Geobacillus MreB to form pairs of filaments in the conditions tested. However, in solution too we do occasionally see pairs of filaments (Fig 2-S2), and also sheet-like structures among aggregates when the concentration of MreB is increased (Fig. 2-S2 and Fig. 3-S2). Thus, we agree with the reviewer that it cannot be claimed that Geobacillus MreB is unable to polymerize in the absence of lipids, but rather that lipids strongly stimulate its polymerization, condition depending.

      B) The results shown in figure 5A also go against this conclusion, as there is only a 2-fold increase in the phosphate release from MreB(Gs) in the presence of membranes relative to the absence of membranes. Thus, if their model is correct, and MreB(Gs) polymers form only on membranes, this would require the unpolymerized MreB monomers to hydrolyze ATP at 1/2 the rate of MreB in filaments. This high relative rate of hydrolysis of monomers compared to filaments is unprecedented. For all polymers examined so far, the rate of monomer hydrolysis is several orders of magnitude less than that of the filament. For example, actin monomers are known to hydrolyze ATP 430,000X slower than the monomers inside filaments (Blanchoin and Pollard, 2002; Rould et al., 2006).

      Response 2.3B. We agree with the reviewer. We have now found conditions where sheets of MreB form in solution (at high MreB concentration) in the presence of ADP and AMP-PNP. However, we have now added several controls that exclude efficient formation of polymers in solution in the presence of ATP at low concentrations of MreBGs (≤ 1.5 µM), the condition used for the malachite green assays. At these MreB concentrations, pairs of filaments are observed in the presence of lipids, but very unfrequently in solution, and sheets are not observed in solution either (Fig. 2-S2A, B). Yet, albeit puzzling, in these conditions Pi release is reproducibly observed in solution, reduced only ~ 2 to 3-fold relative to Pi release in the presence of lipids (Fig. 5A and Fig. 5-S1). A reinforcing observation is when the ATPase assays is performed at 100 mM KCl (Fig. 5A). In this condition MreB binding to lipids is increased relative to 500 mM KCl (Fig. 4-S4C), and the stimulation of the ATPase activity by the presence of lipids is also stronger that at 500 mM (Fig. 5-S1A). Further work is needed to characterize in detail the ATPase activity of MreB proteins, for which data in the literature is very scarce. We can’t exclude that MreB could nucleate in solution or form very unstable filaments that cannot be seen in our EM assay but consume ATP in the process. At the moment, the significance of the Pi released in solution is unknown and will require further investigation.

      C) Thus, there is a strong possibility that MreB(Gs) polymers are indeed forming in solution in addition to those on the membrane, and these "solution polymers" may not be captured by their electron microscopy assay. For example, high salt could be interfering with the absorption of filaments to glow discharged lacking lipids.

      Response 2.3C. We appreciate the reviewer’s insight about this critical point. Polymers presented in the original Fig. 2A were obtained at 500 mM KCl but we had tested the polymerization of MreB at 100 mM KCl as well, without noticing differences. We have nonetheless redone this quantitatively and used these data for the revised Fig. 2A, as we are now using 100 mM KCl as our standard polymerization condition throughout the revised manuscript. We also followed the other suggestion of the reviewer and tested glow discharged grids (a more classic preparation for soluble proteins) vs non-glow discharged EM grids, as well as a higher concentration of MreB. Grids are generally glow-discharged to make them hydrophilic in order to adsorb soluble proteins, but the properties of MreB (soluble but obviously presenting hydrophobic domains) made difficult to predict what support putative soluble polymers would preferentially interact with. Septins for example bind much better to hydrophobic grids despite their soluble properties (I. Adriaans, personal communication). Virtually no double filaments were observed in solution at either low or high [MreB]. The fact that in some conditions (high [MreB], other nucleotides) we were able to detect sheet-like structures excluded a technical issue that would prevent the detection of existing but “invisible” polymers here. We have added these new data in Fig. 2-S2.

      As indicated above, the reviewer’s comments made us realize that we could not state or imply that MreB cannot polymerize in the absence of lipids. As a matter of fact, we always saw some random filaments in the EM fields, both in solution and in the presence of non-hydrolysable analogues, at very low frequency (Fig. 2A). And we do see now sheets at high MreB concentration (Fig. 2-S2B). We could be just missing the optimal conditions for polymerisation in solution, while our phrasing gave the impression that no polymers could ever form in the absence of ATP or lipids. Therefore, we have:

      1) analyzed all TEM data to present it as semi-quantitative TEM, using our methodology originally implemented for the analysis of the mutants

      2) reworked the text to remove any issuing statements and to indicate that MreBGs was only found to bind to a lipid monolayer as a double protofilament in the presence of ATP/GTP but that this does not exclude that filaments may also form in other conditions.

      In order to definitively prove that MreB(Gs) does not have polymers in solution, the authors should:

      i) conduct orthogonal experiments to test for polymers in solution. The simplest test of polymerization might be conducting pelleting assays of MreB(Gs) with and without lipids, sweeping through the concentration range as done in 2B and 5a.

      Response 2.3Ci. Following reviewer #2 suggestion, we conducted a series of sedimentation assays in the presence and in the absence of lipids, at low (100 mM) and high (500 mM) salt, for both the wild-type protein and the three membrane-anchoring mutants (all at 1.3 µM). Sedimentation experiments in salt conditions preventing aggregation in solution (500 mM KCl) fitted with our TEM results: MreB wild-type pelleting increased in the presence of both ATP and lipids (Fig. R1). The sedimentation was further increased at 100 mM KCl, which would fit our other results indicating an increased interaction of MreB with the membrane. However, in addition to be poorly reproducible (in our hands), the approach does not discriminate between polymers and aggregates (or monomers bound to liposomes) and since MreB has a strong tendency to aggregate, we believe that the technique is ill-suited to reliably address MreB polymerization and prefer not to include sedimentation data in our manuscript. The recent work from Pande et al. (2022) illustrates well this issue since no sedimentation of MreB (at 2 µM) was observed in solution in conditions supporting polymerization (at 300 mM KCl): ‘the protein does not pellet on its own in the absence of liposome, irrespective of its polymerization state’, implying that sedimentation does not allow to detect MreB5 filaments in solution (Pande et al., 2022).

      ii) They also could examine if they see MreB filaments in the absence of lipids at 100mM salt (as was seen in both Löwe studies), as the high salt used here might block the charges on glow discharged grids, making it difficult for the polymer to adhere.

      See above, Response 2.3C

      iii) Likewise, the claim that MreB lacking the amino-terminus and the α2β7 hydrophobic loop "is required for polymerization" is questionable as if deleting these resides blocks membrane binding, the lack of polymers on the membrane on the grid is not unexpected, as these filaments that cannot bind the membrane would not be observable. Given these mutants cannot bind the membrane, mutant polymers could still indeed exist in solution, and thus pelleting assays should be used to test if non-membrane associated filaments composed of these mutants do or do not exist.

      Response 2.3Ciii. This is a fair point, we thank the reviewer for this remark. We did not mean to state or imply that the hydrophobic loop was required for polymerization per se, but that polymerization into double filaments only efficiently occurs upon membrane binding, which is mediated by the two hydrophobic sequences. We tested all three mutants by sedimentation as suggested by reviewer #2. In the salt condition that limits aggregation (500 mM KCl) the mutants did not pellet while the wild-type protein did (in the presence of lipids) (Fig. R2 below), in agreement with our EM data. We tested the absence of lipids on the mutant bearing the 2 deletions and observed that the (partial) sedimentation observed at low KCl concentration was ATP and lipid dependent (Fig. R3).

      Given our concerns about MreB sedimentation assays (see above, Response 2.3Ci), we prefer not to include these sedimentation data in our manuscript. Instead, we tested by TEM the possible polymerization of the mutants in solution (we only tested them in the presence of lipids in the initial submission). No filaments were detected in solution for any of the mutants (Fig. 4-S3A).

      A final note, the results shown in "Figure 1 - figure supplement 2, panel C" appear to directly refute the claim that MreB(Gs) requires lipids to polymerize. As currently written, it appears they can observe MreB(Gs) filaments on EM grids without lipids. If these experiments were done in the presence of lipids, the figure legend should be updated to indicate that. If these experiments were done in the absence of lipids, the claim that membrane association is required for MreB polymerizations should be revised.

      The TEM experiments show were indeed performed in the presence of lipids. We apologize for this was not clearly stated in the legend. To prevent all confusion, we have nevertheless removed these images in this figure since the polymerization conditions and lipid requirement are not yet presented when this figure is referred to in the text. We have instead added a panel with the calibration curve for the size exclusion profiles as per request of reviewer #3. The main point of this figure is to show the tendency of MreBGs to aggregate: analytical size-exclusion chromatography shows a single peak corresponding to the monomeric MreBGs, molecular weight ~ 37 KDa, in our purification conditions, but it can readily shift to a peak corresponding to high MW aggregates, depending on the protein concentration and/or storage conditions.

      1. (Difference 4) - The next difference between this study and previous studies of MreB and actin homologs is the conclusion that MreB(Gs) must hydrolyze ATP in order to polymerize. This conclusion is surprising, given the fact that both T. Maritima (Salje · 2011, Bean 2008) and B. subtilis MreB (Mayer 2009) have been shown to polymerize in the presence of ATP as well as AMP-PNP.

      Likewise, MreB polymerization has been shown to lag ATP hydrolysis in not only T. maritima MreB (Esue 2005), eukaryotic actin, and all other prokaryotic actin homologs whose polymerization and phosphate release have been directly compared: MamK (Deng et al., 2016), AlfA (Polka et al., 2009), and two divergent ParM homologs (Garner et al., 2004; Rivera et al., 2011). Currently, the only piece of evidence supporting the idea that MreB(Gs) must hydrolyze ATP in order to polymerize comes from 2 observations: 1) using electron microscopy, they cannot see filaments of MreB(Gs) on membranes in the presence of AMP-PNP or ApCpp, and 2) no appreciable signal increase appears testing AMPPNP- MreB(Gs) using QCM-D. This evidence is by no means conclusive enough to support this bold claim: While their competition experiment does indicate AMPPNP binds to MreB(Gs), it is possible that MreB(Gs) cannot polymerize when bound to AMPPNP.

      For example, it has been shown that different actin homologs respond differently to different non-hydrolysable analogs: Some, like actin, can hydrolyze one ATP analog but not the other, while others are able to bind to many different ATP analogs but only polymerize with some of one of them.

      Response 2.4. We agree with the reviewer, it is uncertain what analogs bind because they are quite different to ATP and some proteins just do not like them, they can change conditions such that filaments stop forming as well and be (theoretically) misleading. This is why we had tested ApCpp in addition to AMP-PNP as non-hydrolysable analog (Fig. 3A). As indicated above, our new complementary experiments (Fig. 3-S1B-D) now show that some rare (i.e. unfrequently and in limited amount) dual polymers are detected in the presence of ApCpp (Fig. 3A) and at high MreB concentration only in the presence of AMP-PNP (Fig. 3-S1B-D), suggesting different critical concentrations in the presence of alternative nucleotides. We have dampened our conclusions, in the light of our new data, and modified the discussion accordingly.

      Thus, to further verify their "hydrolysis is needed for polymerization" conclusion, they should:

      A. Test if a hydrolysis deficient MreB(Gs) mutant (such as D158A) is also unable to polymerize by EM.

      Response 2.4A. We thank the reviewer for this suggestion. As this conclusion has been reviewed on the basis of our new data (see previous response), testing putative ATPase deficient mutants is no longer required here. The study of ATPase mutants is planned for future studies (see Response 3.10 to reviewer #3).

      B. They also should conduct an orthogonal assay of MreB polymerization aside from EM (pelleting assays might be the easiest). They should test if polymers of ATP, AMP-PNP, and MreB(Gs)(D158A) form in solution (without membranes) by conducting pelleting assays. These could also be conducted with and without lipids, thereby also addressing the points noted above in point 3.

      Response 2.4B. Please see Response 2.3Ci above.

      C. Polymers may indeed form with ATP-gamma-S, and this non-hydrolysable ATP analog should be tested.

      Response 2.4C. It is fairly possible that ATP-γ-S supports polymerization since it is known to be partially hydrolysable by actin giving a mild phenotype (Mannherz et al, 1975). This molecule can even be a bona fide substrate for some ATPases (e.g. (Peck & Herschlag, 2003). Thus, we decided to exclude this “non-hydrolysable” analog and tested instead AMP-PNP and ApCpp. We know that ATP-γ-S has been and it is still frequently used, but we preferred to avoid it for the moment for the above-indicated reasons. We chose AMPPNP and AMPPCP instead because (1) they were shown to be completely non-hydrolysable by actin, in contrast to ATP-γ-S; (2) they are widely used (the most commonly used for structural studies; (Lacabanne et al, 2020), (3) AMPPNP was previously used in several publications on MreB (Bean & Amann, 2008; Nurse & Marians, 2013; Pande et al., 2022; Popp et al., 2010; Salje et al, 2011; van den Ent et al., 2014)and thus would allow direct comparison. AMPPCP was added to confirm the finding with AMP-PNP. There are many other analogs that we are planning to explore in future studies (see next Response, 2.4D).

      D. They could also test how the ADP-Phosphate bound MreB(Gs) polymerizes in bulk and on membranes, using beryllium phosphate to trap MreB in the ADP-Pi state. This might allow them to further refine their model.

      Response 2.4D. We plan to address the question of the transition state in depth in following-up work, using a series of analogs and mutants presumably affected in ATPase activity, both predicted and identified in a genetic screen. As indicated above, it is uncertain what analogs bind because they are quite different to ATP and some may bind but prevent filament formation. Thus, we anticipate that trying just one may not be sufficient, they can change conditions and be (theoretically) misleading and thus a thorough analysis is needed to address this question. Since our model and conclusions have been revised on the basis of our new data, we believe that these experiments are beyond the scope of the current manuscript.

      E. Importantly, the Mayer study of B. subtilis MreB found the same results in regard to nucleotides, "In polymerization buffer, MreB produced phosphate in the presence of ATP and GTP, but not in ADP, AMP, GDP or AMP-PNP, or without the readdition of any nucleotide". Thus this paper should be referenced and discussed

      Response 2.4E. We agree that Pi release was detected previously. We have added the reference (L121)

      1. (Difference 5) - The introduction states (lines 128-130) "However, the need for nucleotide binding and hydrolysis in polymerization remains unclear due to conflicting results, in vivo and in vitro, including the ability of MreB to polymerize or not in the presence of ADP or the non-hydrolysable ATP analog AMP-PNP."

      A) While this is a great way to introduce the problem, the statement is a bit vague and should be clarified, detaining the conflicting results and appropriate references. For example, what conflicting in vivo results are they referring to? Regarding "MreB polymerization in AMP-PNP", multiple groups have shown the polymerization of MreB(Tm) in the presence of AMP-PNP, but it is not clear what papers found opposing results.

      Response 2.5A. Thanks for the comment. We originally did not detail these ‘conflicting results’ in the Introduction because we were doing it later in the text, with the appropriate references, in particular in the Discussion (former L433-442). We have now removed this from the Discussion section and added a sentence in the introduction too (L123-130) quickly detailing the discrepancies and giving the references.

      • For more clarity, we have removed the “in vivo” (which referred to the distinct results reported for the presumed ATPase mutants by the Garner and Graumann groups) and focus on the in vitro discrepancies only.

      • These discrepancies are the following: while some studies showed indeed polymerization (as assessed by EM) of MreBTm in the presence of AMPPNP, the studies from Popp et al and Esue et al on T. maritima MreB, and of Nurse et al on E. coli MreB reported aggregation in the presence of AMP-PNP (Esue et al., 2006; Popp et al., 2010) or ADP (Nurse & Marians, 2013), or no assembly in the presence of ADP (Esue et al., 2006). As for the studies reporting polymerization in the presence of AMP-PNP by light scattering only (Bean & Amann, 2008; Gaballah et al, 2011; Mayer & Amann, 2009; Nurse & Marians, 2013), they could not differentiate between aggregates or true polymers and thus cannot be considered conclusive.

      B) The statement "However, the need for nucleotide binding and hydrolysis in polymerization remains unclear due to conflicting results, in vivo and in vitro, including the ability of MreB to polymerize or not in the presence of ADP or the non-hydrolyzable ATP analog AMP-PNP" is technically incorrect and should be rephrased or further tested.

      i. For all actin (or tubulin) family proteins, it is not that a given filament "cannot polymerize" in the presence of ADP but rather that the ADP-bound form has a higher critical concentration for polymer formation relative to the ATP-bound form. This means that the ADP polymers can indeed polymerize, but only when the total protein exceeds the ADP critical concentration. For example, many actin-family proteins do indeed polymerize in ADP: ADP actin has a 10-fold higher critical concentration than ATP actin, (Pollard, 1984) and the ADP critical concentrations of AlfA and ParM are 5X and 50X fold higher (respectively) than their ATP-bound forms(Garner et al., 2004; Polka et al., 2009)

      Response 2.5Bi. Absolutely correct. We apologize for the lack of accuracy of our phrasing and have corrected it (L123).

      ii. Likewise, (Mayer and Amann, 2009) have already demonstrated that B. subtilis MreB can polymerize in the presence of ADP, with a slightly higher critical concentration relative to the ATP-bound form.

      Response 2.5Bii. In Mayer and Amann, 2009, the same light scattering signal (interpreted as polymerization) occurred regardless of the nucleotide, and also in the absence of nucleotide (their Fig. 10) and ATP-, ADP- and AMP-PNP-MreB ‘displayed nearly indistinguishable critical concentrations’. They concluded that MreB polymerization is nucleotide-independent. Please see below (responses to ’Other points to address’) our extensive answer to the Mayer & Amann recurring point of reviewer #2

      Thus, to prove that MreB(Gs) polymers do not form in the presence of ADP would require one to test a large concentration range of ADP-bound MreB(Gs). They should test if ADP- MreB(Gs) polymerizes at the highest MreB(Gs) concentrations that can be assayed. Even if this fails, it may be the MreB(Gs) ADP polymerizes at higher concentrations than is possible with their protein preps (13uM). An even more simple fix would be to simply state MreB(Gs)-ADP filaments do not form beneath a given MreB(Gs) concentration.

      We agree with the reviewer. Our wording was overstating our conclusions. Based on our new quantifications (Fig. 3-S1B, D), we have rephrased the results section and now indicate that pairs of filaments are occasionally observed in the presence of ADP in our conditions across the range of MreB concentration that could be tested, suggesting a higher critical concentration for MreB-ADP (L310-312). Only at the highest MreB concentration, sheet- and ribbon-like structures were observed in the presence of ADP (Fig. 3-S2B).

      Other Points to address:

      1) There are several points in this paper where the work by Mayer and Amann is ignored, not cited, or readily dismissed as "hampered by aggregation" without any explanation or supporting evidence of that fact.

      We have cited the Mayer study where appropriate. However, we cannot cite it as proof of polymerization in such or such condition since their approach does not show that polymers were obtained in their conditions. Again, they based all their conclusions solely on light scattering experiments, which cannot differentiate between polymers and aggregates.

      A) Lines 100-101 - While the irregular 3-D formations seen formed by MreB in the Dersch 2020 paper could be interpreted as aggregates, stating that the results from specifically the Gaballah and Meyer papers (and not others) were "hampered by aggregation" is currently an arbitrary statement, with no evidence or backing provided. Overall, these lines (and others in the paper) dismiss these two works without giving any evidence to that point. Thus, they should provide evidence for why they believe all these papers are aggregation, or remove these (and other) dismissive statements.

      We apologize if our statements about these reports seemed dismissive or disrespectful, it was definitely not our intention. Light scattering shows an increase of size of particles over time, but there is no way to tell if the scattering is due to organized (polymers) or disorganized (aggregation) assemblies. Thus, it cannot be considered a conclusive evidence of polymerization without the proof that true filaments are formed by the protein in the conditions tested, as confirmed by EM for example. MreB is known to easily aggregate (see our size exclusion chromatography profiles and ones from Dersch 2020 (Dersch et al, 2020), and note that no chromatography profiles were shown in the Mayer report) and, as indicated above, we had similar light scattering results for MreB for years, while only aggregates could be observed by TEM (see above Response 2.3A). Several observations also suggest that aggregation instead of polymerization might be at play in the Mayer study, for example ‘polymerization’ occurring in salt-less buffer but ‘inhibited’ with as low as 100 mM KCl, which should rather be “salting in” (see below). We did not intend to be dismissive, but it seemed wrong to report their conclusions as conclusive evidence. We thought that we had cited these papers where appropriate but then explained that they show no conclusive proof of polymerization and why, but it is evident that we failed at communicating it clearly. We have reworked the text to remove any issuing and arbitrary statement about our concerns regarding these reports (e.g. L93 & L126).

      One important note - There are 2 points indicating that dismissing the Meyer and Amann work as aggregation is incorrect:

      1) the Meyer work on B. subtilis MreB shows both an ATP and a slightly higher ADP critical concentration. As the emergence of a critical concentration is a steady-state phenomenon arising from the association/dissociation of monomers (and a kinetically limiting nucleation barrier), an emergent critical concentration cannot arise from protein aggregation, critical concentrations only arise from a dynamic equilibrium between monomer and polymer.

      • Critical concentration for ATP, ADP or AMPPNP were described in Mayer & Amann (Mayer & Amann, 2009) as “nearly indistinguishable” (see Response 2.5Bii)
      • Protein aggregation depends on the solution (pH and ions), protein concentration and temperature. And above a certain concentration, proteins can become instable, thus a critical concentration for aggregation can emerge.

      2) Furthermore, Meyer observed that increased salt slowed and reduced B. subtilis MreB light scattering, the opposite of what one would expect if their "polymerization signal" was only protein aggregation, as higher salts should increase the rate of aggregation by increasing the hydrophobic effect.

      It is true that at high salt concentration proteins can precipitate, a phenomenon described as “salting out”. However, it is also true that salts help to solubilize proteins (“salting in”), and that proteins tend to precipitate in the absence of salt. Considering that the starting point of the Mayer and Amann experiment (Mayer & Amann, 2009) is the absence of salt (where they observed the highest scattering) and that they gradually reduce this scattering by increasing KCl (the scattering is almost abolished below 100 mM only!) it is plausible that a salting-in phenomenon might be at play, due to increased solubility of MreB by salt. In any case, this cannot be taken as a proof that polymerization rather than aggregation occurred.

      B) Lines 113-137 -The authors reference many different studies of MreB, including both MreB on membranes and MreB polymerized in solution (which formed bundles). However, they again neglect to mention or reference the findings of Meyer and Amann (Mayer and Amann, 2009), as it was dismissed as "aggregation". As B. subtilis is also a gram-positive organism, the Meyer results should be discussed.

      We did cite the Mayer and Amann paper but, as explained above, we cannot cite this study as an example of proven polymerization. We avoided as much as possible to polemicize in the text and cited this paper when possible. Again, we have reworked the text to avoid any issuing or dismissive statement. Also, we forgot mentioned this study at L121 as an example of reported ATPase activity, and this has now been corrected.

      2) Lines 387-391 state the rates of phosphate release relative to past MreB findings: "These rates of Pi release upon ATP hydrolysis (~ 1 Pi/MreB in 6 min at 53{degree sign}C) are comparable to those observed for MreBTm and MreB(Ec) in vitro". While the measurements of Pi release AND ATP hydrolysis have indeed been measured for actin, this statement does not apply to MreB and should be corrected: All MreB papers thus far have only measured Pi release alone, not ATP hydrolysis at the same time. Thus, it is inaccurate to state "rates of Pi release upon ATP hydrolysis" for any MreB study, as to accurately determine the rate of Pi release, one must measure: 1. The rate of polymer over time, 2) the rate of ATP hydrolysis, and 3) the rate of phosphate release. For MreB, no one has, so far, even measured the rates of ATP hydrolysis and phosphate release with the same sample.

      We completely agree with the reviewer, we apologize if our formulation was inaccurate. We have corrected the sentence (L479). Thank you for pointing out this mistake.

      3) The interpretation of the interactions between monomers in the MreB crystal should be more carefully stated to avoid confusion. While likely not their intention, the discussions of the crystal packing contacts of MreB can appear to assume that the monomer-monomer contacts they see in crystals represent the contacts within actual protofilaments. One cannot automatically assume the observations of monomer-monomer contacts within a crystal reflect those that arise in the actual filament (or protofilament).

      We agree, we thank the reviewer for his comments. We have revamped the corresponding paragraph.

      A) They state, "the apo form of MreBGs forms less stable protofilaments than its G- homologs ." Given filaments of the Apo form of MreB(GS) or b. subtilis have never been observed in solution, this statement is not accurate: while the contacts in the crystal may change with and without nucleotide, if the protein does not form polymers in solution in the apo state, then there are no "real" apo protofilaments, and any statements about their stability become moot. Thus this statement should be rephrased or appropriately qualified.

      see above.

      B) Another example: while they may see that in the apo MreB crystal, the loop of domain IB makes a single salt bridge with IIA and none with IIB. This contrasts with every actin, MreB, and actin homolog studied so far, where domain IB interacts with IIB. This might reflect the real contacts of MreB(Gs) in the solution, or it may be simply a crystal-packing artifact. Thus, the authors should be careful in their claims, making it clear to the reader that the contacts in the crystal may not necessarily be present in polymerized filaments.

      Again, we agree with the reviewer, we cannot draw general conclusions about the interactions between monomers from the apo form. We have rephrased this paragraph.

      4) lines 201-202 - "Polymers were only observed at a concentration of MreB above 0.55 μM (0.02 mg/mL)". Given this concentration dependence of filament formation, which appears the same throughout the paper, the authors could state that 0.55 μM is the critical concentration of MreB on membranes under their buffer conditions. Given the lack of critical concentration measurement in most of the MreB literature, this could be an important point to make in the field.

      Following reviewer’s #2 suggestion, we have now estimated the critical concentration (Cc=0.4485 µM) and reported it in the text. (L218).

      5) Both mg/ml and uM are used in the text and figures to refer to protein concentration. They should stick to one convention, preferably uM, as is standard in the polymer field.

      Sorry for the confusion. We have homogenized to MreB concentrations to µM throughout the text and figures.

      6) Lines 77-78 - (Teeffelen et al., 2011) should be referenced as well in regard to cell wall synthesis driving MreB motion.

      This has been corrected, sorry for omitting this reference.

      7) Line 90 - "Do they exhibit turnover (treadmill) like actin filaments?". This phrase should be modified, as turnover and treadmilling are two very different things. Turnover is the lifetime of monomers in filaments, while treadmilling entails monomer addition at one end and loss at the other. While treadmilling filaments cause turnover, there are also numerous examples of non-treadmilling filaments undergoing turnover: microtubules, intermediate filaments, and ParM. Likewise, an antiparallel filament cannot directionally treadmill, as there is no difference between the two filament ends to confer directional polarity.

      This is absolutely true, we apologize for our mistake. The sentence has been corrected (L82).

      8) Throughout the paper, the term aggregation is used occasionally to describe the polymerization shown in many previous MreB studies, almost all of which very clearly showed "bundled" filaments, very distinct entities from aggregates, as a bundle of polymers cannot form without the filaments first polymerizing on their own. Evidence to this point, polymerization has been shown to precede the bundling of MreB(Tm) by (Esue et al., 2005).

      We agree with reviewer #2 about polymers preceding bundles and “sheets”. However, we respectfully disagree that we used the word aggregation “throughout the paper” to describe structures that clearly showed polymers or sheets of filaments. A search (Ctrl-F: “aggreg”) reveals only 6 matches, 3 describing our own observations (L152, 163/5, and 1023/28), one referring to (Salje et al., 2011) (L107) but citing her claim that they observed aggregation (due to the N-terminus), and the last two (L100, L440) refer (again) to the Gaballah/Mayer/Dersch publications to say that aggregation could not be excluded in these reports as discussed above (Dersch et al., 2020; Gaballah et al., 2011; Mayer & Amann, 2009).

      9) lines 106-108 mention that "The N-terminal amphipathic helix of E. coli MreB (MreBEc) was found to be necessary for membrane binding. " This is not accurate, as Salje observed that one single helix could not cause MreB to mind to the membrane, but rather, multiple amphipathic helices were required for membrane association (Salje et al., 2011).

      Salje et al showed that in vivo the deletion of the helix abolishes the association of MreB to the membrane. This publication also shows that in vitro, addition of the helix to GFP (not to MreB) prompts binding to lipid vesicles, and that this was increased if there are 2 copies of the helix, but they could not test this directly in vitro with MreB (which is insoluble when expressed with its N-terminus). This prompted them to speculate that multiple MreBs could bind better to the membrane than monomers. However, this remained to be demonstrated. Additional hydrophobic regions in MreB such as the hydrophobic loop could participate to membrane anchoring but are absent in their in vitro assays with GFP.

      The Salje results imply that dimers (or further assemblies) of MreB drive membrane association, a point that should be discussed in regard to the question "What prompts the assembly of MreB on the inner leaflet of the cytoplasmic membrane?" posed on lines 86-87.

      We agree that this is an interesting point. As it is consistent with our results, we have incorporated it to our model (Fig. 6) and we are addressing it in the discussion L573-575.

      10) On lines 414-415, it is stated, "The requirement of the membrane for polymerization is consistent with the observation that MreB polymeric assemblies in vivo are membrane-associated only." While I agree with this hypothesis, it must be noted that the presence or absence of MreB polymers in the cytoplasm has not been directly tested, as short filaments in the cytoplasm would diffuse very quickly, requiring very short exposures (<5ms) to resolve them relative to their rate of diffusion. Thus, cytoplasmic polymers might still exist but have not been tested.

      This is also an interesting point. Indeed if a nucleated form, or very short (unbundled) polymers exist in the cytoplasm, they have not been tested by fluorescence microscopy. However, the polymers that localize at the membrane (~ 200 nm), if soluble, would have been detected in the cytoplasm by the work of reviewer #2, us or others.

      11) lines 429-431 state, "but polymerization in the presence of ADP was in most cases concluded from light scattering experiments alone, so the possibility that aggregation rather than ordered polymerization occurred in the process cannot be excluded."

      A) If an increased light scattering signal is initiated by the addition of ADP (or any nucleotide), that signal must come from polymerization or multimerization. What the authors imply is that there must be some ADP-dependent "aggregation" of MreB, which has not been seen thus far for any polymer. Furthermore, why would the addition of ADP initiate aggregation?

      We did not mean that ADP itself would prompt aggregation, but that the protein would aggregate in the buffer regardless of the presence of ADP or other nucleotides. The Mayer & Amann study claims that MreB “polymerization” is nucleotide-independent, as they got identical curves with ATP, ADP, AMPPNP and even with no nucleotides at all (Fig. 10 in their paper, pasted here) (Mayer & Amann, 2009).

      Their experiments with KCl are also remarkable as when they lowered the salt they got faster and faster “polymerization”, with the strongest light scattering signal in the absence of any salt. The high KCl concentration in which they got almost no more “polymers” was 75 mM KCl, and ‘polymerization was almost entirely inhibited at 100 mM’ (Fig. 7, pasted below). Yet the intracellular level of KCl in bacteria is estimated to be ~300 mM (see Response 1.1)

      B) Likewise, the statement "Differences in the purity of the nucleotide stocks used in these studies could also explain some of the discrepancies" is unexplained and confusing. How could an impurity in a nucleotide stock affect the past MreB results, and what is the precedent for this claim?

      We meant that the presence of ATP in the ADP stocks might have affected the outcome of some assays, generating the conflicting results existing in the literature. We agree this sentence was confusing, we have removed it.

      12) lines 467-469 state, "Thus, for both MreB and actin, despite hydrolyzing ATP before and after polymerization, respectively, the ADP-Pi-MreB intermediate would be the long-lived intermediate state within the filaments."

      A) For MreB, this statement is extremely speculative and unbiased, as no one has measured 1) polymerization, 2) ATP hydrolysis, and 3) phosphate release. For example, it could be that ATP hydrolysis is slow, while phosphate release is fast, as is seen in the actin from Saccharomyces cerevisiae.

      We agree that this was too speculative. This has been removed from the (extensively) modified Discussion section. Thanks for the comment.

      B) For actin, the statement of hydrolysis of ATP of monomer occurring "before polymerization" is functionally irrelevant, as the rate of ATP hydrolysis of actin monomers is 430,000 times slower than that of actin monomers inside filaments (Blanchoin and Pollard, 2002; Rould et al., 2006).

      We agree that the difference of hydrolysis rate between G-actin and F-actin implies that ATP hydrolysis occurs after polymerization. We are afraid that we do not follow the reviewer’s point here, we did not say or imply that ATP hydrolysis by actin monomers was functionally relevant.

      13) Lines 442-444. "On the basis of our data and the existing literature, we propose that the requirement for ATP (or GTP) hydrolysis for polymerization may be conserved for most MreBs." Again, this statement both here (and in the prior text) is an extremely bold claim, one that runs contrary to a large amount of past work on not just MreB, but also eukaryotic actin and every actin homolog studied so far. They come to this model based on 1) one piece of suggestive data (the behavior of MreB(GS) bound to 2 non-hydrolysable ATP analogs in 500mM KCL), and 2) the dismissal (throughout the paper) of many peer-reviewed MreB papers that run counter to their model as "aggregation" or "contaminated ATP stocks ." If they want to make this bold claim that their finding invalidates the work of many labs, they must back it up with further validating experiments.

      We respectfully disagree that our model was based on “one piece of suggestive data” and backed-up by dismissing most past work in the field. We only wanted to raise awareness about the conflicting data between some reports (listed in response 2.5a), and that the claims made by some publications are to be taken with caution because they only rely on light scattering or, when TEM was performed, showed only disorganized structures.

      This said, we clearly failed in proposing our model and we are sorry to see that we really annoyed the reviewer with our suspicion that the work by Mayer & Amann reports aggregation. As indicated above, we have amended our manuscript relative to this point. We also agree that our suggestion to generalize our findings to most MreBs was unsupported, and overstated considering how confusing some result from the literature are. We have refined our model and reworked the text to take on board the reviewer’s remarks as well as the new data generated during the revision process.

      We would like to thank reviewer #2 for his in-depth review of our manuscript.  

      Reviewer #3 (Public Review):

      The major claim from the paper is the dependence of two factors that determine the polymerization of MreB from a Gram-positive, thermophilic bacteria 1) The role of nucleotide hydrolysis in driving the polymerization. 2) Lipid bilayer as a facilitator/scaffold that is required for hydrolysis-dependent polymerization. These two conclusions are contrasting with what has been known until now for the MreB proteins that have been characterized in vitro. The experiments performed in the paper do not completely justify these claims as elaborated below.

      We understand the reviewer’ concerns in view of the existing literature on actin and Gram-negative MreBs. We may just be missing the optimal conditions for polymerisation in solution, while our phrasing gave the impression that polymers could never form in the absence of ATP or lipids. Our new data actually shows that MreBGs at higher concentration can assemble into bundle- and sheet-like structures in solution and in the presence of ADP/AMP-PNP. Pairs of filaments are however only observed in the presence of lipids for all conditions tested. As indicated in the answers to the global review comments, we have included our new data in the manuscript, revised our conclusions and claims about the lipid requirement and expanded on these points in the Discussion.

      Major comments:

      1) No observation of filaments in the absence of lipid monolayer can also be accounted due to the higher critical concentration of polymerization for MreBGS in that condition. It is seen that all the negative staining without lipid monolayer condition has been performed at a concentration of 0.05 mg/mL. It is important to check for polymerization of the MreBGS at higher concentration ranges as well, in order to conclusively state the requirement of lipids for polymerization.

      Response 3.1. 0.05 mg/ml (1.3µM) is our standard condition, and our leeway was limited by the rapid aggregation observed at higher MreB concentrations, as indicated in the text. We have now tested as well 0.25 mg/ml (6.5 µM - the maximum concentration possible before major aggregation occurs in our experimental conditions). At this higher concentration, we see some sheet-like structures in solution, confirming a requirement of a higher concentration of MreB for polymerization in these conditions (see the answers to the global review comments for more details)

      We thank the reviewer for pushing us to address this point. We have revised our conclusions accordingly.

      2) The absence of filaments for the non-hydrolysable conditions in the lipid layer could also be because the filaments that might have formed are not binding to the planar lipid layer, and not necessarily because of their inability to polymerize.

      Response 3.2. This is a fair point. To test the possibility that polymers would form but would not bind to the lipid layer we have now added additional semi-quantitative EM controls (for both the non-hydrolysable ATP analogs and the three ‘membrane binding’ deletion mutants) testing polymerization in solution (without lipids) and also using plasma-treated grids. These showed that in our standard polymerization conditions, virtually no polymers form in solution (Fig. 3-S1B and Fig. 4-S4A). Albeit at very low frequency, some dual protofilaments were however detected in the presence of ADP or AMP-PNP at the high MreB concentration (Fig. 3-S1D). At this high MreB concentration, the sheet-like structures occasionally observed in solution in the presence of ATP were frequent in the presence of ADP and very frequent in the presence of AMP-PNP (Fig. 3-S2B). We have revised our conclusions on the basis of these new data: MreBGs can form polymeric assemblies in solution and in the absence of ATP hydrolysis at a higher critical concentration than in the presence of ATP and lipids.

      See the answers to the global review comments (point 2) and Response 2.3C to reviewer #2 for more details.

      3) Given the ATPase activity measurements, it is not very convincing that ATP rather than ADP will be present in the structure. The ATP should have been hydrolysed to ADP within the structure. The structure is now suggestive that MreB is not capable of hydrolysis, which is contradictory to the ATP hydrolysis data.

      Response 3.3. We thank the reviewer for her insightful remarks about the MreB-ATP crystal structure. The electron density map clearly demonstrates the presence of 3 phosphates. However, as suggested by the reviewer, the density which was attributed to a Mg2+ ion was to be interpreted as a water molecule. The absence of Mg2+ in the crystal could thus explain why the ATP had not been hydrolyzed.

      References

      Arino J, Ramos J, Sychrova H (2010) Alkali metal cation transport and homeostasis in yeasts. Microbiology and molecular biology reviews 74: 95-120

      Bean GJ, Amann KJ (2008) Polymerization properties of the Thermotoga maritima actin MreB: roles of temperature, nucleotides, and ions. Biochemistry 47: 826-835

      Cayley S, Lewis BA, Guttman HJ, Record MT, Jr. (1991) Characterization of the cytoplasm of Escherichia coli K-12 as a function of external osmolarity. Implications for protein-DNA interactions in vivo. Journal of molecular biology 222: 281-300

      Dersch S, Reimold C, Stoll J, Breddermann H, Heimerl T, Defeu Soufo HJ, Graumann PL (2020) Polymerization of Bacillus subtilis MreB on a lipid membrane reveals lateral co-polymerization of MreB paralogs and strong effects of cations on filament formation. BMC Mol Cell Biol 21: 76

      Eisenstadt E (1972) Potassium content during growth and sporulation in Bacillus subtilis. Journal of bacteriology 112: 264-267

      Epstein W, Schultz SG (1965) Cation Transport in Escherichia coli: V. Regulation of cation content. J Gen Physiol 49: 221-234

      Esue O, Wirtz D, Tseng Y (2006) GTPase activity, structure, and mechanical properties of filaments assembled from bacterial cytoskeleton protein MreB. Journal of bacteriology 188: 968-976

      Gaballah A, Kloeckner A, Otten C, Sahl HG, Henrichfreise B (2011) Functional analysis of the cytoskeleton protein MreB from Chlamydophila pneumoniae. PloS one 6: e25129

      Harne S, Duret S, Pande V, Bapat M, Beven L, Gayathri P (2020) MreB5 Is a Determinant of Rod-to-Helical Transition in the Cell-Wall-less Bacterium Spiroplasma. Curr Biol 30: 4753-4762 e4757

      Kang H, Bradley MJ, McCullough BR, Pierre A, Grintsevich EE, Reisler E, De La Cruz EM (2012) Identification of cation-binding sites on actin that drive polymerization and modulate bending stiffness. Proceedings of the National Academy of Sciences of the United States of America 109: 16923-16927

      Lacabanne D, Wiegand T, Wili N, Kozlova MI, Cadalbert R, Klose D, Mulkidjanian AY, Meier BH, Bockmann A (2020) ATP Analogues for Structural Investigations: Case Studies of a DnaB Helicase and an ABC Transporter. Molecules 25

      Mannherz HG, Brehme H, Lamp U (1975) Depolymerisation of F-actin to G-actin and its repolymerisation in the presence of analogs of adenosine triphosphate. Eur J Biochem 60: 109-116

      Mayer JA, Amann KJ (2009) Assembly properties of the Bacillus subtilis actin, MreB. Cell motility and the cytoskeleton 66: 109-118

      Nurse P, Marians KJ (2013) Purification and characterization of Escherichia coli MreB protein. The Journal of biological chemistry 288: 3469-3475

      Pande V, Mitra N, Bagde SR, Srinivasan R, Gayathri P (2022) Filament organization of the bacterial actin MreB is dependent on the nucleotide state. The Journal of cell biology 221

      Peck ML, Herschlag D (2003) Adenosine 5 '-O-(3-thio)triphosphate (ATP-gamma S) is a substrate for the nucleotide hydrolysis and RNA unwinding activities of eukaryotic translation initiation factor eIF4A. Rna 9: 1180-1187

      Popp D, Narita A, Maeda K, Fujisawa T, Ghoshdastider U, Iwasa M, Maeda Y, Robinson RC (2010) Filament structure, organization, and dynamics in MreB sheets. The Journal of biological chemistry 285: 15858-15865

      Rhoads DB, Waters FB, Epstein W (1976) Cation transport in Escherichia coli. VIII. Potassium transport mutants. J Gen Physiol 67: 325-341

      Rodriguez-Navarro A (2000) Potassium transport in fungi and plants. Biochimica et biophysica acta 1469: 1-30

      Salje J, van den Ent F, de Boer P, Lowe J (2011) Direct membrane binding by bacterial actin MreB. Molecular cell 43: 478-487

      Schmidt-Nielsen B (1975) Comparative physiology of cellular ion and volume regulation. J Exp Zool 194: 207-219

      Szatmari D, Sarkany P, Kocsis B, Nagy T, Miseta A, Barko S, Longauer B, Robinson RC, Nyitrai M (2020) Intracellular ion concentrations and cation-dependent remodelling of bacterial MreB assemblies. Sci Rep-Uk 10

      van den Ent F, Izore T, Bharat TA, Johnson CM, Lowe J (2014) Bacterial actin MreB forms antiparallel double filaments. eLife 3: e02634

      Whatmore AM, Chudek JA, Reed RH (1990) The Effects of Osmotic Upshock on the Intracellular Solute Pools of Bacillus subtilis. Journal of general microbiology 136: 2527-2535

    1. Author Response:

      Reviewer #2 (Public Review):

      This work uses a throughput continuous culture system with simplified soil microbial communities to investigate how diversity-disturbance relationships (DDRs) change with different disturbance "intensities" (here, defined as mortality rate or dilution rate in a continuous system) and "frequencies" (here, defined as the number of dilution events that occur per day to achieve the desired mortality rate). Understanding the mechanisms that support different DDR is an ongoing and urgent need in ecology and ecosystem sciences because of the pressing need to predict and manage systems given climate and land-use disturbances.

      A major strength of the work is a blending of modeling and empirical approaches. It includes an ambitiously-designed study that uses a controlled, high-throughput microbial community experimental system to observe disturbance outcomes and uses those observations to build their proposed quantitative framework. The figures are informative and framework is explained clearly. The authors propose and name a new mechanism, "niche-flip" that describes resource competition at varying disturbance "intensities" - this is an interesting proposal and I suggest that it is explored more fully as a potential mechanism (see weaknesses).

      Weaknesses of the work are the use of definitions that are generally inconsistent with the disturbance ecology literature, and the inability to separate the disturbance event characteristic of "intensity" from the biological outcome of mortality. The authors conclude that DDRs are contextual, which is supported by their modeling and data, but I suggest that they consider that diversity as an outcome in itself may not be the most informative metric of what mechanism(s) drive context-specific outcomes. The authors have a lot of compositional data that could also be examined to understand whether their "niche-flip" mechanism is supported.

      This work is likely to advance our understanding of the myriad of outcomes of DDR and what potential mechanisms may support those DDR in natural ecosystems.

      Thank you for your kind words and careful review of our manuscript. We are pleased you appreciate both the experiments and the modeling work, and that you are intrigued by the findings and the niche flip mechanism.

      Major comments:

      Comment 1. Ecological definitions and interdependence of disturbance outcomes/attributes

      The authors define disturbance "intensity" as the average mortality rate but claim that this is a disturbance characteristic. However, mortality rate is not a characteristic of a disturbance event, but rather an effect/outcome of a disturbance on the biological community. The key distinction is that disturbance characteristics (also called traits or aspects) are defined relative to the environment, while disturbance outcomes (also called effects, impacts, or responses) are defined relative to the biology of interest, in this case a microbial community. So, changes in diversity of the community, as a result of a disturbance, is a biological outcome of the disturbance. An average mortality rate, what the authors call "intensity" (L40) would be such an outcome.

      Thank you for this excellent point. We have revised the introduction to make this distinction, reproduced here for convenience:

      "Accordingly, there have been many efforts aimed at understanding the role of environmental disturbances, which are perturbations to the state of an environment. These disturbances are of ecological interest for the impact they have on a community, for example, by bringing about mortality of organisms and a reduction of biomass of a community."

      The authors' definition of "intensity" is not in agreement with the disturbance ecology literature, including the references cited in this current work. For example, in reference #18 (Miller et al. 2011 PNAS) disturbance aspects include intensity, timing, duration, extent, and interval. Specifically, Miller et al. 2011 defined intensity as the magnitude of the disturbance (e.g., a flood's maximum stage). Notably, Miller's definition of intensity is more aligned with the author's definition of "fluctuation," which the authors define as the "magnitude of deviations from the average". In the current work, the disturbance "event" cannot be separated from the biological outcome because of the nature of the continuous culture system. The system is not being disturbed with, for example, a change in pH or salinity or another environmental variable that results in microbial mortality, but rather the loss of viable members from the community through control of the flow-through. So, the mortality is both the precisely controlled disturbance "event" and "outcome" in the continuous culture.

      To summarize, the premise of the article is confusing, because one of the two disturbance "characteristics" considered is, rather a disturbance outcome. This may seem like mincing words and to each paper its own definitions, but because this work seeks to reconcile DDRs as reported across many studies, and because many of the previous ecology studies that have investigated or reported DDRs are not using analogous terms, the work could further confusion rather than serve as a reconciliation. When different definitions are applied that mix disturbance aspects with biological outcomes of disturbance, readers will have to work hard to understand this work in context with the existing literature. I suggest revising the introductory section to be consistent in terminology with the ecology literature and to be framed not only as disturbance characteristics, but also outcomes. I also suggest adding discussion of how an inability to distinguish disturbance event from outcome may influence interpretation of this work and its broader application. I suggest adding clarification/discussion of "how intensity and fluctuations interact" (e.g. L200): as the authors define intensity and fluctuation of the disturbance event, intensity is not independent of the biological disturbance outcome of mortality in the given model system. So, how the two "disturbance components interact" is not able to be examined independently from the biological outcome (mortality, resulting diversity).

      These are also critical points. First, we will address the choice of terminology (re: Miller et al) and, second, the equivalence between disturbance and outcome in continuous culture.

      We agree that careful use of terminology is important for understanding our work in context of the literature. Accordingly, we have replaced our characteristics “intensity” and “fluctuation” with “mean intensity” and “frequency” throughout the paper. We have also added more examples through the results section to indicate how mean intensity, frequency, and maximum dilution rates (during disturbance events) are related.

      "To determine whether the effects of disturbance on diversity are truly fluctuation-dependent15, a disturbance should ideally be decomposed into distinct components of mean intensity (e.g. time-averaged disturbance magnitude) and frequency (e.g. temporal profile of fluctuations)."

      The direct connection between disturbance and mortality in a continuous culture system under dilution disturbances is a critical aspect of our experimental design, because we wanted to compare disturbance outcomes that varied in temporal features (in Miller et al terms, intensity/magnitude vs frequency/timing) while holding mortality equal. In continuous culture this may be achieved by controlling dilution rate and frequency, but you are correct that other classes of natural disturbances such as pH or salinity changes may have different effects on community members. As a first step towards investigating these effects, we had included analyses with non-equal mortality rates (Appendix figure 4). We have now edited the introduction and discussion to emphasize that the equivalence between disturbance event and disturbance outcome is a feature specific to continuous culture.

      Introduction

      "Dilution is perhaps the most common choice for a laboratory disturbance, as it causes species-independent mortality and replenishes the system with fresh nutrients, reminiscent of flow in soil, aquatic, or gut microbiomes. Unlike disturbances with indirect biological impacts (such as pH, temperature, or osmolarity disturbances), there is a direct link between the dilution disturbance event (removal of culture volume) and the biological outcome (mortality of community members)."

      Discussion

      "We also note however, that these types of disturbances do not share the direct link between environmental change and biological outcome that is characteristic of dilution disturbance, so the impact may be less clear."

      Comment 2: Compositional evidence for the proposed "niche flip" mechanism and suggestion for deeper consideration of population-level response to disturbance outcomes that collectively contribute to emergent diversity values.

      Regarding the "niche flip" - it is unclear whether there is compositional evidence for any swap in niche preference/space among particular community members. Figure S8 may offer evidence, but I could not deduce it from the busy bar charts. Could population/ASV level analysis be conducted on each member to assess their dynamics and ask whether the dynamics support the proposed niche-flip as a DDR mechanism?

      This is a very interesting suggestion. As suggested, we could extract the relative preferences of different ASVs from composition data to test a prediction about changes in the composition resulting from niche flip. To make such a prediction, we’d need the Monod growth parameters of the species on relevant resources. We began collecting this data (see Figure 3 – figure supplement 4) but found it challenging to measure these parameters on defined media sources. Furthermore, since we elected to run our main experiments in a complex media that could potentially support diverse communities (as opposed to minimal medias which produce simple communities, see Goldford et al Science 2018) we cannot link Monod growth parameters in this media to particular resources. Subsequent experiments with defined species with measured Monod parameters in defined media would enable us to make and test predictions. These are sizeable experiments that we do not believe are in the scope of the present work. Without a testable prediction, we do not believe species or ASV level analysis to be particularly informative on its own.

      Related, there seems to be possible evidence of a "fluctuation" rate threshold, after which there is a major compositional shift in the microbial community. Consider Figure 3: At all "intensities", there is a shift in microbial community composition between "fluctuation" rates of 4/day and 16/day (3d, Fig S8). This threshold/shift is not also apparent in the Shannon diversity in Fig 3f. This could be an example in which diversity as a metric in itself is not as informative/useful outcome for disturbance responses, as identical Shannon diversity values can result from different community compositions that are themselves the outcomes of different mechanisms. I see from the PCoAs (Fig S9) that the authors were exploring potential compositional clustering by day, frequency, and dilution - the most "obvious" clustering to the eye is indeed by "frequency" and between 4/day and 16/day (red/blue separation along both axes, which also supports a potential threshold/shift. Generally, it would have been good to report statistical tests (e.g., PERMANOVA or equivalent) for these PCoA categories (where it makes sense, nested and term interactions as well) - is there statistical support for compositional threshold shift between 4/16?

      Thank you for these suggestions. Indeed, by eye and by the PCoA plots, there seems to be a significant difference in composition that separate the low-frequency (1/day & 4/day) from the high-frequency (16/day & Constant) conditions. We calculated pairwise distances between Day 6 samples grouped by A) dilution frequency, B) mean dilution rate, or C) combinations of dilution rate and frequency. Using these distances to perform PERMANOVA tests, we find significant differences between cultures with different frequencies, but not for cultures with different dilution rates. For combinations, we found several pairs with differences that were significant only before correction for false-discovery rate. Distances between low-frequency (1/day & 4/day) conditions are much smaller than between low-frequency and high-frequency groups, or between the high-frequency groups. We have now included this as Figure 3 – figure supplement 9 and have summarized the results in the main text, reproduced below for convenience:

      "PERMANOVA statistical analysis of endpoint compositions confirmed that dilution frequency (but not mean dilution rate) had a significant effect on composition (Figure 3 – figure supplement 9). Despite separation between conditions in PCoA of endpoint compositions (Figure 3 – figure supplement 9), PERMANOVA analysis of dilution rate and frequency combinations did not yield significant values after correcting for false discovery rate."

      Reviewer #3 (Public Review):

      This manuscript focuses on the relationship between diversity and disturbance. The authors study this relationship in experimental microbial communities. These communities as subject to different levels of disturbance, which is identified as the dilution rate. The authors find a non-monotonic relationship between diversity and dilution rate. In presence of temporal fluctuations, the non-monotonic relationship becomes less evident, disappearing for strong enough fluctuations. The experimental findings are well explained by a consumer-resource model with Monod response.

      The results of the paper are a very interesting combination of experimental and theoretical work. The manuscript is well written and easy to follow.

      Experiments. The data support the main result of the paper. The U-shaped disturbance-diversity relationship (DDR) is robust (e.g., independent of the measure of diversity). The experimental setup is innovative.

      Theory. A main strength of the manuscript is the clarity in which the model reproduces the experimental data. It is also interesting that alternative models (Lotka-Volterra and consumer-resource with linear response) do not reproduce the data, therefore indicating the relevance of the data themselves. The main weakness of the paper is that, in the end, the mechanism behind the non-monotonicity of the DDR is not completely clear. The authors discuss how it emerges with two species and two resources in presence of a trade-off between maximal growth rate and resource-limited growth rate: at low dilution rate, the species with high maximal growth rate wins, while at high dilution rate the one with resource-limited growth rate dominates. This mechanism is clear with two species (in which diversity can transition between 2 and 1). It is unclear what happens for more species and resources. In particular, the role of the tradeoff --- which is central in the pairwise competition case --- is unclear: the U-shapes relationship is observed also in absence of the tradeoff for multispecies communities.

      Thank you for your enthusiasm about our work and your careful review of our manuscript. We are pleased you appreciate the concordance between experiment and model in our study.